count of events per month to run line plot - r

I have a dataset where every row correspond to a participant. It has a categorical variable called "Injury.Cause"
Injury.Date.Time Injury.Cause
3608 2019-05-22 00:00:00 Motor Vehicle
3915 2019-03-25 10:00:00 Accidental
3916 2019-03-25 16:00:00 Burn
3917 2019-03-25 10:00:00 Accidental
3920 2019-03-25 00:00:00 Fall
3928 2019-03-27 00:00:00 Fall
3929 2019-03-26 21:50:00 Motor Vehicle
3930 2019-03-27 17:00:00 Fall
3931 2019-03-26 00:00:00 Motor Vehicleter
I want to run line plot with multiple lines (each line represent a cause of injury over time) and the y-axis shows the total number of occurrence(frequency)/month for each cause of injury
I assume the first step is I have to make my data ordered as follow
Date Motor Vehicle Accidental Burn Fall
2021-03-22 3 2 1 2
2021-03-23 1 1 0 3
this example is shown in days but I believe I can control the time frame when making the plot. I will be exploring the changes across periods of 3 months intervals on the x-axis
Thank you in advance
Rami

We can get the data in required structure using table after extracting the date from timestamp.
table(as.Date(df$Injury.Date.Time), df$Injury.Cause)
Or in tidyverse -
library(tidyverse)
df %>%
count(Date = as.Date(Injury.Date.Time), Injury.Cause) %>%
pivot_wider(names_from = Injury.Cause, values_from = n, values_fill = 0)
# Date Accidental Burn Fall `Motor Vehicle` `Motor Vehicleter`
# <date> <int> <int> <int> <int> <int>
#1 2019-03-25 2 1 1 0 0
#2 2019-03-26 0 0 0 1 1
#3 2019-03-27 0 0 2 0 0
#4 2019-05-22 0 0 0 1 0
However, if you want to plot the data you should have data in long format and not in wide format.
df %>%
count(Date = as.Date(Injury.Date.Time), Injury.Cause) %>%
ggplot() + aes(Date, n, color = Injury.Cause) + geom_line()

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

how to reassing a changed cell value back in a dataframe in r with the tidyverse

I am trying to do some work with data for building energy consumption. The data are sufficiently large (about 20 m. lines) and i am using the train and building_metadata dataframes joined in one from the ashrae energy prediction from kaggle here
I have created an new variable energy_sqm and i found for example in a certain building an outage i want to correct for. The code is working fine in finding and mutating the variables i want to fix.
But i cannot get how to re-insert it in the original dataframe since tidyverse works by not doing inplace substitution of the values. Thus i need to somehow to put the new values in the df but the <- does not work.
The code that works:
joined %>%
filter(building_id == 778, meter == 0, timestamp == ymd_hms("2016-08-24 15:00:00")) %>%
mutate(
meter_reading = joined %>%
filter(building_id == 778, meter == 0, timestamp == ymd_hms("2016-08-23 15:00:00")) %>%
pull(meter_reading),
energy_sqm = joined %>%
filter(building_id == 778, meter == 0, timestamp == ymd_hms("2016-08-23 15:00:00")) %>%
pull(energy_sqm))
and the output is :
# A tibble: 1 x 11
building_id meter timestamp meter_reading site_id primary_use square_feet year_built
<dbl> <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl>
1 778 0 2016-08-24 14:00:00 328. 6 Entertainm~ 108339 NA
# ... with 3 more variables: floor_count <dbl>, square_meter <dbl>, energy_sqm <dbl>
which is what i want with the changes included.
If i try to reinsert it in the original position which is the
joined %>%
filter(building_id == 778, meter == 0, timestamp == ymd_hms("2016-08-24 15:00:00"))
with the <- operation it does not work.
So how can i reassign such values back in the dataframe? It is not a modification of the whole df which i can easy just reassing it. It is only a filtered row basically that i am changing the values of two variables (energy_meter and energy_sqm) and need to reinsert.
To clarify the basic dataframe looks like
A tibble: 8,775 x 11
building_id meter timestamp meter_reading site_id
<dbl> <dbl> <dttm> <dbl> <dbl>
1 778 0 2016-01-01 00:00:00 172. 6
2 778 0 2016-01-01 01:00:00 171. 6
3 778 0 2016-01-01 02:00:00 171. 6
4 778 0 2016-01-01 03:00:00 171 6
5 778 0 2016-01-01 04:00:00 171. 6
6 778 0 2016-01-01 05:00:00 170. 6
7 778 0 2016-01-01 06:00:00 174. 6
8 778 0 2016-01-01 07:00:00 174. 6
9 778 0 2016-01-01 08:00:00 175. 6
10 778 0 2016-01-01 09:00:00 174. 6
# ... with 8,765 more rows, and 6 more variables: primary_use <chr>,
# square_feet <dbl>, year_built <dbl>, floor_count <dbl>,
# square_meter <dbl>, energy_sqm <dbl>
and to see the variables i want to change is :
A tibble: 8,775 x 4
building_id meter timestamp energy_sqm
<dbl> <dbl> <dttm> <dbl>
1 778 0 2016-01-01 00:00:00 0.0171
2 778 0 2016-01-01 01:00:00 0.0170
3 778 0 2016-01-01 02:00:00 0.0170
4 778 0 2016-01-01 03:00:00 0.0170
5 778 0 2016-01-01 04:00:00 0.0170
6 778 0 2016-01-01 05:00:00 0.0169
7 778 0 2016-01-01 06:00:00 0.0172
8 778 0 2016-01-01 07:00:00 0.0173
9 778 0 2016-01-01 08:00:00 0.0174
10 778 0 2016-01-01 09:00:00 0.0173
# ... with 8,765 more rows
dplyr 1.0.0 has a new function rows_update() for this purpose. If you assume your original dataset is basic_df and your changes are stored in changes_df you can use:
library(dplyr)
rows_update(basic_df, changes_df, by = c("build_id", "meter","timestamp"))
To assign this back to your original data:
basic_df <- rows_update(basic_dy, changes_df,
by = c("build_id", "meter","timestamp"))
#Wil Thank you very much for your suggestion and time. I hadn't seen of that update. It is a lot sorter than the one i am posting, however i could not make rows_update work. It kept through an error "Error: Attempting to update missing rows.". The rows_upsert however with the same configuration worked perfectly in adding another row with the single row dataframe i created with my above code.
Now the solution that worked was a twist of the known
df %>% mutate(variable_name = replace(variable_name, variable_name == something, replace_value)).
In order however not flatten the dataframe i put the search inside the mutate like below
joined <- joined %>%
mutate(meter_reading = replace(meter_reading,
# having already found where the problematic value is a am setting it
meter_reading == joined %>%
filter(building_id == 778,
meter == 0,
timestamp == ymd_hms("2016-08-24 14:00:00")) %>%
pull(meter_reading),
# and replacing the meter_reading value with the same value of the previous day
joined %>% filter(building_id == 778,
meter == 0,
timestamp == ymd_hms("2016-08-23 14:00:00")) %>%
pull(meter_reading)))
The search part and/or the replacement part can be defined previously to make the code sorter. This way the df gets updated and you can reassign it to the original one with no problem.
I hope someone finds this useful.

dplyr grouping and using a conditional from multiple columns

I have a data frame like this
transactionId user_id total_in_pennies created_at X yearmonth
1 345068 8 9900 2018-09-13 New Customer 2018-09-01
2 346189 8 9900 2018-09-20 Repeat Customer 2018-09-01
3 363500 8 7700 2018-10-11 Repeat Customer 2018-10-01
4 376089 8 7700 2018-10-25 Repeat Customer 2018-10-01
5 198450 11 0 2018-01-18 New Customer 2018-01-01
6 203966 11 0 2018-01-25 Repeat Customer 2018-01-01
it has many more rows, but that little snippet can be used.
I am trying to group using dplyr so I can get a final data frame like this
I use this code
df_RFM11 <- data2 %>% group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"), Repeat_Customers=sum(X=="Repeat Customer"), New_Customers_sales=sum(total_in_pennies & X=="New Customers"), Repeat_Customers_sales=sum(total_in_pennies & X=="Repeat Customers"))
and I get this result
> head(df_RFM11)
# A tibble: 6 x 5
yearmonth New_Customers Repeat_Customers New_Customers_sales Repeat_Customers_sales
<date> <int> <int> <int> <int>
1 2018-01-01 4880 2428 0 0
2 2018-02-01 2027 12068 0 0
3 2018-03-01 1902 15296 0 0
4 2018-04-01 1921 13363 0 0
5 2018-05-01 2631 18336 0 0
6 2018-06-01 2339 14492 0 0
and I am able to get the first 2 column I need, the count of new customers and repeat customers, but i get 0's when I try to get the sum of "total_in_pennies" for New Customers and repeat customer
Any help on what am i doing wrong?
You'd need to put them in brackets, like below:
df_RFM11 <- data2 %>%
group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"),
Repeat_Customers=sum(X=="Repeat Customer"),
New_Customers_sales=sum(total_in_pennies[X=="New Customer"]),
Repeat_Customers_sales=sum(total_in_pennies[X=="Repeat Customer"])
)

Break up rows representing long time intervals into multiple rows

I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)

Resources