Subset by increasing count over time in R - r

I have a data frame that contains count of sales by seller, year and month, called sales_by_month:
library(tidyverse)
sales_by_month <- tribble(
~Seller, ~Year, ~Month, ~Sales,
"John Doe", 2018, 1, 82,
"John Doe", 2018, 2, 72,
"John Doe", 2018, 3, 42,
"Sally Jane", 2018, 1, 25,
"Sally Jane", 2018, 2, 77)
I need to subset this dataset by only those sellers where their sales are increasing over time, and I cannot figure out how to do it.
The resulting subset dataset should contain;
Seller Year Month Sales
Sally Jane 2018 1 25
Sally Jane 2018 2 77
Because Sally's sales are increasing, while John's sales are decreasing.
Any help would be very much appreciated!!

First, make a variable representing the difference in Sales(I name it dif). If dif < 0, it means decreasing values exist in somebody's sales.
library(dplyr)
df %>% arrange(Seller, Year, Month) %>%
group_by(Seller) %>%
mutate(dif = c(0, diff(Sales))) %>%
filter(all(dif >= 0)) %>%
select(-dif) # drop dif
# Seller Year Month Sales
# <fct> <int> <int> <int>
# 1 SallyJane 2018 1 25
# 2 SallyJane 2018 2 77
More concise:
df %>% group_by(Seller) %>%
arrange(Seller, Year, Month) %>%
filter(all(c(0, diff(Sales)) >= 0))

How to do this really depends on how you want to define increasing over time. One way to define increasing over time is if there is a month over month increase. My solution just looks at if the last month had an increase but this could be changed to look at it different ways:
We calculate the change every month. We filter for just the last month and if it was a positive change. Then we pull out unique Seller names.
We filter for the seller names we obtained in part 1.
The code below does this as well as a dataframe we can load straight into R
library(tidyverse)
sales_by_month <- tribble(
~Seller, ~Year, ~Month, ~Sales,
"John Doe", 2018, 1, 82,
"John Doe", 2018, 2, 72,
"John Doe", 2018, 3, 42,
"Sally Jane", 2018, 1, 25,
"Sally Jane", 2018, 2, 77)
increased_from_last_month <- sales_by_month %>%
group_by(Seller) %>%
arrange(Seller, Year, Month) %>%
mutate(change = Sales - lag(Sales, default = 0)) %>%
summarise_all(last) %>%
filter(change > 0) %>%
pull(Seller) %>%
unique()
sales_by_month %>%
filter(Seller %in% increased_from_last_month)

Related

Easy method for checking for duplicates and errata in panel dataset

Imagine a dataframe:
df1 <- tibble::tribble( ~City, ~Year, ~Burger_cost, ~Cola_cost, ~Resident_AVGGrowth_cm,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"New York", 2020, 4, 5, 500,
"New York", 2019, 3, 5, 184,
"New York", 2018, 2, 3, 183,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"Abu Dhabi", 2017, 1, NA, 100,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185,
"New York", 2020, 4, 5, 185,
"New York", 2019, 3, 5, 184,
"New York", 2018, 3, 3, 183,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185)
The same dataset for visual representation:
There can be many inputs. For example, data for London is totally similar for all years, so we can delete it. The data for Abu Dhabi is similar for years 2018-2020 and has difference for 2017.
The data for New York contains discrepancy in year 2018 for the Burger cost (and growth of a city resident).
The data for the growth of a city resident is surely erroneus in row 4 for NY (too huge descrepancy), but it has a duplicate value in the row 16 (in this case we would prefer delete row 4 and keep row 16 based on criteria that no person can be shorter than 50 cm and longer than 4 meters [400 cm] [especially in the mean values :)]).
Rows 6 and 16 (for NY, 2018) contain different data for the burger cost which can only be resolved by human (say, variant with 3 USD is correct but R needs to show the fact of discrepancy for the end user).
The question: can you suppose nice and neat methods for performing these operations? What do you usually use to solve such issues.
I just started to think on this issue.
It is an easy task in C#. I am keeping in mind several strategies, but I would be extremely interested to know what ways do you use for solving such issues :) Any ideas are much appreciated :)
Perhaps this helps - Grouped by 'City', create a flag for the outliers with boxplot on the 'Resident_AVGGrowth_cm', then add 'Year' into the grouping, create another flag for unique values based on the columns Burger_cost to Resident_AVGGrowth_cm using n_distinct and looping over if_all (returns TRUE only if all the columns looped for the row are returning TRUE), then grouped by City, filter out those City having all duplicates e.g. London, as well as remove the rows with outlier_flag
library(dplyr)
df1 %>%
group_by(City) %>%
mutate(outlier_flag = Resident_AVGGrowth_cm %in%
boxplot(Resident_AVGGrowth_cm, plot = FALSE)$out) %>%
group_by(Year, .add = TRUE) %>%
mutate(flag_all_unq = if_all(Burger_cost:Resident_AVGGrowth_cm,
~ n_distinct(.x, na.rm = TRUE) == 1)) %>%
group_by(City) %>%
filter(!all(flag_all_unq)) %>%
ungroup %>%
filter((!outlier_flag)|flag_all_unq)

Is there an easy way to extend a single column in an R dataframe?

I want to extend my dataset with missing observations in order to compute forecasts.
This means I want to extend my 'time' column and set all the new cells from the other columns to NA:
Time1 <- c(2019, 2020, 2021, 2022)
data1 <- c(3, 4, 1, 4)
df1 <- cbind(Time1, data1)
Time2 <- c(2019, 2020, 2021, 2022, 2023, 2024, 2025)
data2 <- c(3, 4, 1, 4, NA, NA, NA)
df2 <- cbind(Time2, data2)
Is there an easy way to get from df1 to df2 without creating a new dataframe?
You can do it like this:
library(dplyr)
df1 <- as_tibble(df1)
df1 %>% add_row(Time1 = seq(from =2023, to = 2025, by = 1))

Predict down years/rows based on previous years/rows, with lmer and dplyr

I have a linear mixed effects model that determines change in grass based on both the previous year's grass and several environmental variables (and their interaction) at different distinct sites over time.
Using this mixed effects model and established, projected future environmental variables, I want to predict change in grass density. Each year's prediction thus depends on the previous year's density, located on the row above it in my dataframe. We begin with a real value from the present year, and then predict into the future.
library(tidyverse); library(lme4)
#data we have from the past, where each site has annual ChlA/Sal/Temp as well as grass density. our formula, change.mod, predicts grass.change, based on these env variables AND last year's grass coverage (grass.y1)
ThePast = tibble(
year = c(2017, 2018, 2019, 2020, 2021, 2017, 2018, 2019, 2020, 2021,2017, 2018, 2019, 2020, 2021),
site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"),
ChlA = c(50, 210, 190, 101, 45, 20, 20, 80, 5, 40, 25, 12, 11, 5, 20),
Sal= c(1, 4, 5, 0.1, 10, 18, 14, 17, 10, 21, 30, 28, 25, 20, 22),
Temp = c(28, 21, 24, 25, 22, 19, 20, 17, 18, 15, 18, 16, 19, 20, 20),
grass = c(.5, .3, .1, .4, .1, .25, .33, .43, .44, .08, .75, .54, .69, .4, .6)) %>%
group_by(site) %>%
mutate(grass.y1 = lag(grass, order_by = year)) %>% #last year's grass
mutate(grass.change = grass - grass.y1) %>% #calculate change
ungroup()
#the ME model
change.mod = lmer(grass.change ~ grass.y1 + log10(ChlA) + log10(Sal) + grass.y1:log10(Temp) + grass.y1:log10(Sal) + (1|site), data = ThePast)
#Future environmental data per site per year, to be used to predict grass.
TheDistantFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18))
#The final dataframe should look like this, where both of the grass columns are predicted out into the future. could have the grass.y1 column in here if we wanted
PredictedFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18),
grass = c(0.237, 0.335, 0.457, 0.700, 0.151, 0.361, 0.176, 0.380, 0.684),
grass.change = c(0.1368, 0.2550, -0.1425, -0.1669, -0.18368, -0.0962, 0.106, 0.229, 0.323 ))
Right now, I can generate the next year's (2022) correct predictions using group_by() and predict(), referencing last year's grass density with a lag function.
#How do we get to PredictedFuture?? Here is what I'm trying:
FutureIsNow = ThePast %>%
filter(year == 2021) %>% #take last year of real data to have baseline starting grass density
bind_rows(TheDistantFuture) %>% #bind future data
arrange(site, year) %>% #arrange by site then year
group_by(site) %>% #maybe this should be rowwise?
mutate(grass.change = predict(change.mod, newdata = data.frame(
grass.y1 = lag(grass, n = 1, order_by = year),
ChlA = ChlA, Sal = Sal, Temp = Temp, site = site))) %>% #this correctly predicts 2022 grass change
mutate(grass = grass.change + lag(grass, n = 1)) #this also works to calculate grass in 2022
This df looks like this:
> FutureIsNow
# A tibble: 12 × 7
# Groups: site [3]
year site ChlA Sal Temp grass grass.change
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 A 45 10 22 NA NA
2 2022 A 40 12 24 0.237 0.137
3 2023 A 95 16 28 NA NA
4 2024 A 149 21 31 NA NA
5 2021 B 40 21 15 NA NA
6 2022 B 200 11 22 0.335 0.255
7 2023 B 10 21 29 NA NA
8 2024 B 10 21 20 NA NA
9 2021 C 20 22 20 NA NA
10 2022 C 10 15 26 0.457 -0.143
11 2023 C 4 32 32 NA NA
12 2024 C 15 22 18 NA NA
Close, but not really repeatable...
Any ideas for predicting grass change for 2023, 2024, down the rows? I prefer working in tidyverse, though it may be possible to solve this more easily with nested for loops. Potential solutions include a rowwise data structure, or maybe to nest_by(station), but I don't know how to then reference the grass.y1 column. Maybe the solution could be via a rolling prediction with rollify, but I am not sure!
Thank you in advance for your help! Long time reader, first time asker!
So, let's go with a simpler example here for a reprex to show how purrr::accumulate2() can work for you here.
Let's setup a discrete time population model where there is also some covariate that affects time
$N_t = 1.5N_{t-1} + C$
Simple! Heck, we can even use accumulate2 to simulate a population, and then add some noise.
library(tidyverse)
# ok, let's make a population from a simple discrete time growth model
# but, with a covariate!
covariate <- runif(5, 5, 10)
# use accumulate2 with the covariate to generate a population timeseries
pop <- accumulate2(1:5,covariate, ~.x*1.5 + .y, .init = 0) %>% unlist()
pop <- pop[-1]
pop_obs <- rnorm(5, pop, 1) #add some noise
Great! Now, turn it into data and fit a model
# the data ####
dat <- tibble(
time = 1:5,
covariate = covariate,
pop_obs = pop_obs,
lag_pop = lag(pop_obs)
)
# the model ####
mod <- lm(pop_obs ~ covariate + lag_pop, data = dat)
# does this look reasonable?
coef(mod)
My coefficients looked reasonable, but, set a seed and see!
Now we will need some data we want to simulate for - new covariates, but, we will need to incorporate the lag.
# now, simulation data ####
simdat <- tibble(
time = 6:10,
covariate = runif(5, 15,20),
lag_pop = dat$pop_obs[5] #the last lagged value!
)
Great! To make this work, we'll need a function that takes arguments of the lagged value and covariate and runs a prediction. Note, here the second argument is just a numeric. But, you could pass an element of a list - a row of a data frame, if you will. This might be accomplished later with some rowwise nesting or somesuch. For you to work out!
# OK, now we need to get predictions for pop at each step in time! ####
sim_pred <- function(lag_pop, covariate){
newdat <- tibble(covariate = covariate,
lag_pop = lag_pop)
predict(mod, newdata = newdat)
}
With this in hand, we can simulate forward using lag_pop to generate a new population. Note, we'll need to use .init to make sure our first value is correct as well as strip off the final value (I think...might want to check that)
# and let her rip!
# note, we have to init with the first value and
# for multiple covariates, make a rowwise list -
# each element of the list is
# one row of the data and the sim_pred function takes it apart
simdat %>%
mutate(pop = accumulate2(lag_pop,
covariate,
~sim_pred(.x, .y),
.init = lag_pop[1]) %>% `[`(-1) %>% unlist())
That should do!

How to pivot wider in R on one column value

Below is the sample data and the manipulations that I have done so far. I have tried this in other ways but have an idea that may make this a bit simpler. The intended result is at the bottom. what i am looking for is a way to pivot wider based on when the smb column says total. There are five possible values for smb.. 1,2,3,4, and total. I want there to be a new column smb.total which will have the total for each smb/year/qtr/area combination. I have tried putting a filter in front of the pivot wider statement (at the bottom)
library(readxl)
library(dplyr)
library(stringr)
library(tidyverse)
library(gt)
employment <- c(1,45,125,130,165,260,600,601,2,46,127,132,167,265,601,602,50,61,110,121,170,305,55,603,52,66,112,123,172,310,604,605)
small <- c(1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA)
area <-c(001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
year<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
qtr <-c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
smbtest <- data.frame(employment,small,area,year,qtr)
smbtest$smb <-0
smbtest <- smbtest %>% mutate(smb = case_when(employment >=0 & employment <100 ~ "1",employment >=0
& employment <150 ~ "2",employment >=0 & employment <250 ~ "3", employment >=0 & employment <500 ~
"4", employment >=0 & employment <100000 ~ "Total"))
smbsummary2<-smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area,period,smb) %>%
summarise(employment = sum(employment), worksites = n(),
.groups = 'drop_last') %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2<- smbsummary2%>%
group_by(area,smb)%>%
mutate(empprevyear=lag(employment),
empprevyearpp=employment-empprevyear,
empprevyearpct=((employment/empprevyear)-1),
empprevyearpct=scales::percent(empprevyearpct,accuracy = 0.01)
)
area period smb employment worksites smb.Total
1 2020q1 1 46 2 1927
1 2020q1 2 301 4 1927
1 2020q1 3 466 5 1927
1 2020q1 4 726 6 1927
1 2020q1 Total 1927 8 1927
smbsummary2<-smbsummary2 %>%
filter(small=='Total')
pivot_wider(names_from = small, values_from = employment)
Maybe this code will solve your question:
employment <- c(1, 45, 125, 130, 165, 260, 600, 601, 2, 46, 127,
132, 167, 265, 601, 602, 50, 61, 110, 121, 170,
305, 55, 603, 52, 66, 112, 123, 172, 310, 604, 605)
small <- c(1, 1, 2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA, 1, 1,
2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA)
area <-c(001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001,
001, 001, 001, 001, 003, 003, 003, 003, 003, 003, 003, 003,
003, 003, 003, 003, 003, 003, 003, 003)
year<-c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020)
qtr <-c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
smbtest <- tibble(employment, small, area, year, qtr)
smbtest$smb <- 0
smbtest <- smbtest %>%
mutate(smb = case_when(employment >=0 & employment <100 ~ "1",
employment >=0 & employment <150 ~ "2",
employment >=0 & employment <250 ~ "3",
employment >=0 & employment <500 ~ "4",
employment >=0 & employment <100000 ~ "Total"))
smbtest <- smbtest %>%
relocate(smb, year, qtr, area, small, employment)
smbsummary2 <- smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area, period, smb) %>%
summarise(employment = sum(employment),
worksites = n()) %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2 %>%
group_by(area, period) %>%
mutate(`employ/period (%)` = employment/employment[smb == "Total"]*100)
Probably not the best answer, but for your data I think it's works well.
If not please tell me.
Good job!
I do know if I understand correctly.
Do you wanna smb.total of what? employment variable?
If yes.
In your object "smbsummary2" use this code:
smbsummary2 <- smbtest %>%
relocate(smb, year, qtr, area, small, employment) %>%
group_by(smb, year, qtr, area) %>%
mutate(smb.total = n())
If was not it, do you could be explain me better?

glmmTMB with autocorrelation of irregular times

I'm putting together a glmmTMB model. I have data collected at a single site over the course of May, every year, for 4 years. Time resolution within year can range from a few minutes (or even same minute) to days apart.
The covariance vignette says that the ar1() structure requires a regular time series, but the ou(times + 0 | group) structure can handle irregular times. That said - it looks like the times argument is a factor - how does that work with irregular time structure??
So, for example, is this a correct use of the ou() structure?
df <- structure(list(DayYear = c(234, 220, 234, 231, 243, 229, 228,
223, 220, 218, 234, 237, 234, 231, 241, 237, 241, 241, 233, 234,
234, 232, 218, 227, 232, 229, 220, 223, 228, 224), DateTime = structure(c(1495477980,
1399590540, 1495479780, 1495225920, 1464631980, 1495052760, 1463324460,
1494525780, 1494256560, 1494088440, 1495471320, 1495730940, 1495476960,
1495225200, 1432919940, 1495725900, 1432924200, 1432918860, 1495384020,
1495479900, 1463848140, 1495298820, 1399420080, 1463253000, 1463692920,
1495037040, 1494275160, 1494510780, 1463348220, 1494597180), class = c("POSIXct",
"POSIXt"), tzone = ""), Year = c(2017, 2014, 2017, 2017, 2016,
2017, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2015, 2017,
2015, 2015, 2017, 2017, 2016, 2017, 2014, 2016, 2016, 2017, 2017,
2017, 2016, 2017), N = c(2, 2, 7, 2, 6, 4, 1, 4, 1, 3, 1, 6,
2, 2, 2, 2, 5, 5, 3, 5, 3, 2, 4, 1, 6, 2, 2, 3, 5, 2)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
create sampling factor within year
df <- df %>%
arrange(DateTime) %>%
group_by(Year) %>%
mutate(times = 1:n()) %>%
ungroup() %>%
mutate(YearF = as.factor(Year),
times = numFactor(times))
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df)
This particular model doesn't run too well because the toy dataset is so tiny (and probably doesn't show what I need showing) - but is that a correct specification of the autocorrelation structure under an irregular time series?
No, it's not: you have to use decimal times/dates in numFactor. The way you've done it coerces the data set to be equally spaced. Below I use lubridate::decimal.date(DateTime) %% 1 to get the fraction-of-year variable that's used as the time coordinate.
library(dplyr)
library(lubridate)
library(glmmTMB)
df2 <- (df
%>% arrange(DateTime)
%>% group_by(Year)
%>% mutate(times = lubridate::decimal_date(DateTime) %% 1)
%>% ungroup()
)
df3 <- (df2
%>% mutate(YearF = as.factor(Year),
times = glmmTMB::numFactor(times))
%>% select(N, DayYear, YearF, times)
)
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df3)

Resources