Running complex functions per row - r

I would like to use complex functions in a nested data frame.
My data looks like this:
Name Date
John 01.01.
Mark 03.09.
Edith 03.04.
Edith 08.08.
Mark 04.01.
Edith 01.03.
John 01.03.
John 01.04.
Mark 02.03.
Edith 04.05.
Edith 07.05.
Mark 04.02.
Edith 09.01.
John 01.09.
In a new column Day, For each name, I would like to know the number of days between a given Date row and the earliest date for that person.
So that John will look like:
Day
0
..
2
2
..
9
I am experimenting with nest(), then running a function with modfiy, but I am very new to R, and it doesn't work I and looks don't really understand what even the problem is.
Thanks for help!

Note that it is not clear from your sample data, whether you are using %d.%m. or %m.%d. format. Please change that in the code if needed.
library(tidyverse)
library(lubridate)
df <- read_table(
'name date
John 01.01.
Mark 03.09.
Edith 03.04.
Edith 08.08.
Mark 04.01.
Edith 01.03.
John 01.03.
John 01.04.
Mark 02.03.
Edith 04.05.
Edith 07.05.
Mark 04.02.
Edith 09.01.
John 01.09.')
df %>%
mutate(date = as_date(date, "%d.%m.")) %>%
group_by(name) %>%
mutate(diff_dates = date - min(date))
Result:
> df
# A tibble: 14 x 3
name date diff_dates
<chr> <date> <drtn>
1 John 2019-01-01 0 days
2 Mark 2019-09-03 245 days
3 Edith 2019-04-03 92 days
4 Edith 2019-08-08 219 days
5 Mark 2019-01-04 3 days
6 Edith 2019-03-01 59 days
7 John 2019-03-01 59 days
8 John 2019-04-01 90 days
9 Mark 2019-03-02 60 days
10 Edith 2019-05-04 123 days
11 Edith 2019-05-07 126 days
12 Mark 2019-02-04 34 days
13 Edith 2019-01-09 8 days
14 John 2019-09-01 243 days

Using the dplyr package we get
library(dplyr)
data <- data %>%
mutate(Date = as.Date(Date, format = "%m.%d")) %>%
group_by(Name) %>%
mutate(early = min(Date)) %>%
mutate(Day = difftime(Date, early, units = "days"))
data
# # A tibble: 14 x 4
# # Groups: Name [3]
# Name Date early Day
# <fct> <date> <date> <time>
# 1 John 2019-01-01 2019-01-01 0 days
# 2 Mark 2019-03-09 2019-02-03 34 days
# 3 Edith 2019-03-04 2019-01-03 60 days
# 4 Edith 2019-08-08 2019-01-03 217 days
# 5 Mark 2019-04-01 2019-02-03 57 days
# 6 Edith 2019-01-03 2019-01-03 0 days
# 7 John 2019-01-03 2019-01-01 2 days
# 8 John 2019-01-04 2019-01-01 3 days
# 9 Mark 2019-02-03 2019-02-03 0 days
# 10 Edith 2019-04-05 2019-01-03 92 days
# 11 Edith 2019-07-05 2019-01-03 183 days
# 12 Mark 2019-04-02 2019-02-03 58 days
# 13 Edith 2019-09-01 2019-01-03 241 days
# 14 John 2019-01-09 2019-01-01 8 days
Edited as per Cole's recommendations.

Related

How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Get daily average with R [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 6 months ago.
I have a data.frame with some prices per day. I would like to get the average daily price in another column (avg_price). How can I do that ?
date price avg_price
1 2017-01-01 01:00:00 10 18.75
2 2017-01-01 01:00:00 10 18.75
3 2017-01-01 05:00:00 25 18.75
4 2017-01-01 04:00:00 30 18.75
5 2017-01-02 08:00:00 10 20
6 2017-01-02 08:00:00 30 20
7 2017-01-02 07:00:00 20 20
library(lubridate)
library(tidyverse)
df %>%
group_by(day = day(date)) %>%
summarise(avg_price = mean(price))
# A tibble: 2 x 2
day avg_price
<int> <dbl>
1 1 18.8
2 2 20
df %>%
group_by(day = day(date)) %>%
mutate(avg_price = mean(price))
# A tibble: 7 x 4
# Groups: day [2]
date price avg_price day
<dttm> <dbl> <dbl> <int>
1 2017-01-01 01:00:00 10 18.8 1
2 2017-01-01 01:00:00 10 18.8 1
3 2017-01-01 05:00:00 25 18.8 1
4 2017-01-01 04:00:00 30 18.8 1
5 2017-01-02 08:00:00 10 20 2
6 2017-01-02 08:00:00 30 20 2
7 2017-01-02 07:00:00 20 20 2

How to delete rows in a dataframe that correspond to missing rows in another dataframe?

I have two dataframes with two columns each (Date and data). The lenght of the columns differs. What I want to do is to delete the rows in df1 that are not in df2 by Date.
An example will clarify. These are my dataframes:
df1 = cbind(data.frame(Date = seq(as.Date("2018-11-1"), as.Date("2020-02-1"), by = "months"), stringsAsFactors = F), data.frame(Data = rnorm(16, 0, 1), stringsAsFactors = F))
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 -0.19712728
4 2019-02-01 0.99852535
5 2019-03-01 -0.50760024
6 2019-04-01 -0.43127396
7 2019-05-01 0.90685965
8 2019-06-01 0.51510503
9 2019-07-01 -0.39070644
10 2019-08-01 1.27976428
11 2019-09-01 -0.63845519
12 2019-10-01 -0.05489751
13 2019-11-01 -0.87745923
14 2019-12-01 0.18082375
15 2020-01-01 0.08852416
16 2020-02-01 1.50827788
df2= cbind(data.frame(Date = df1$Date[c(1:5,7:9,11:13,15:16)]), data.frame(Data = c(1.09433662,-0.27538189, 0.99852535,-0.50760024,-0.43127396, 0.90685965,-0.39070644, 1.27976428,-0.63845519,-0.05489751,-0.87745923, 0.18082375, 1.50827788)))
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 0.99852535
4 2019-02-01 -0.50760024
5 2019-03-01 -0.43127396
6 2019-05-01 0.90685965
7 2019-06-01 -0.39070644
8 2019-07-01 1.27976428
9 2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01 0.18082375
13 2020-02-01 1.50827788
What I want now is that df1 is reduced to the same length as df2 by deleting the rows that are not in df2. The rows to be deleted correspond to the missing months in df2.
The result would be this for df1:
#df1 where the rows corresponding to the missing months in df2 have been deleted
Date Data
1 2018-11-01 1.09433662
2 2018-12-01 -0.27538189
3 2019-01-01 -0.19712728
4 2019-02-01 0.99852535
5 2019-03-01 -0.50760024
6 2019-05-01 0.90685965
7 2019-06-01 0.51510503
8 2019-07-01 -0.39070644
9 2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01 0.08852416
13 2020-02-01 1.50827788
Can anyone help me?
Thanks a lot!
semi_join from dplyr does what you are looking for. Note that your copied the data from df2 as the output example.
library(dplyr)
semi_join(df1, df2, by = "Date")
Date Data
1 2018-11-01 0.38376758
2 2018-12-01 -0.28738352
3 2019-01-01 1.79556305
4 2019-02-01 -0.34680836
5 2019-03-01 0.57803280
6 2019-05-01 1.96801082
7 2019-06-01 0.38448708
8 2019-07-01 0.39829417
9 2019-09-01 0.94912096
10 2019-10-01 -0.04469681
11 2019-11-01 0.32008546
12 2020-01-01 1.09054839
13 2020-02-01 -1.45438502
and anti_join shows the records that should be removed.
anti_join(df1, df2, by = "Date")
Date Data
1 2019-04-01 2.1303783
2 2019-08-01 1.6907800
3 2019-12-01 -0.8593388

How to iterate rows between start_date and end_date in R

I have a dataframe that looks like this:
And here is the output I'm hoping for.
This should work. The key is to use uncount from dplyr package. Then you need to do some operations regarding the datetime. There are some tricky issues in calculating the difference in months. What I proposed here may not be the best way to do it, but you get the idea.
library(tidyverse)
library(lubridate)
df = tibble(name = c('Alice', 'Bob', 'Caroline'),
start_date = as.Date(c('2019-01-01','2018-03-01','2019-06-01')),
end_date = as.Date(c('2019-07-01','2019-05-01','2019-09-01')))
# # A tibble: 3 x 3
# name start_date end_date
# <chr> <date> <date>
# 1 Alice 2019-01-01 2019-07-01
# 2 Bob 2018-03-01 2019-05-01
# 3 Caroline 2019-06-01 2019-09-01
df %>% mutate(tenure_in_month = as.integer(difftime(end_date, start_date, units = "days")/365*12+2))%>%
uncount(tenure_in_month)%>%
group_by(name)%>%
mutate(iteratedDate = start_date %m+% months(row_number()-1))%>%
select(name,iteratedDate)
# A tibble: 28 x 2
# Groups: name [3]
name iteratedDate
<chr> <date>
1 Alice 2019-01-01
2 Alice 2019-02-01
3 Alice 2019-03-01
4 Alice 2019-04-01
5 Alice 2019-05-01
6 Alice 2019-06-01
7 Alice 2019-07-01
8 Bob 2018-03-01
9 Bob 2018-04-01
10 Bob 2018-05-01
I use seq function to fix this problem.
library(data.table)
library(lubridate)
# data
original_data <- data.table(
CustomerName = c('Ben','Julie','Angelo','Carlo'),
StartDate = c(ymd(20190101),ymd(20180103),ymd(20190106),ymd(20170108)),
EndDate = c(ymd(20190107),ymd(20190105),ymd(20190109),ymd(20180112))
)
# CustomerName StartDate EndDate
#1: Ben 2019-01-01 2019-01-07
#2: Julie 2018-01-03 2019-01-05
#3: Angelo 2019-01-06 2019-01-09
#4: Carlo 2017-01-08 2018-01-12
finish_data <- original_data %>%
.[,.(IteratedDate = seq(from = StartDate,
to = EndDate, by = 'day')), by = .(CustomerName)]
# CustomerName IteratedDate
#1: Ben 2019-01-01
#2: Ben 2019-01-02
#3: Ben 2019-01-03
#4: Ben 2019-01-04
#5: Ben 2019-01-05
#6: Ben 2019-01-06
#7: Ben 2019-01-07
#8: Julie 2018-01-03
#9: Julie 2018-01-04

How to create a column based on two conditions from other data frame?

I'm trying to create a column that identifies if the row meets two conditions. For example, I have a table similar to this:
> dat <- data.frame(Date = c(rep(c("2019-01-01", "2019-02-01","2019-03-01", "2019-04-01"), 4)),
+ Rep = c(rep("Mike", 4), rep("Tasha", 4), rep("Dane", 4), rep("Trish", 4)),
+ Manager = c(rep("Amber", 2), rep("Michelle", 2), rep("Debbie", 4), rep("Brian", 4), rep("Tim", 3), "Trevor"),
+ Sales = floor(runif(16, min = 0, max = 10)))
> dat
Date Rep Manager Sales
1 2019-01-01 Mike Amber 6
2 2019-02-01 Mike Amber 3
3 2019-03-01 Mike Michelle 9
4 2019-04-01 Mike Michelle 2
5 2019-01-01 Tasha Debbie 9
6 2019-02-01 Tasha Debbie 6
7 2019-03-01 Tasha Debbie 0
8 2019-04-01 Tasha Debbie 4
9 2019-01-01 Dane Brian 3
10 2019-02-01 Dane Brian 6
11 2019-03-01 Dane Brian 6
12 2019-04-01 Dane Brian 1
13 2019-01-01 Trish Tim 6
14 2019-02-01 Trish Tim 7
15 2019-03-01 Trish Tim 6
16 2019-04-01 Trish Trevor 1
Out of the Reps that have switched manager, I would like to identify weather this manager is the first or the second manager with respect to the date. The ideal output would look something like:
Date Rep Manager Sales New_Column
1 2019-01-01 Mike Amber 6 1
2 2019-02-01 Mike Amber 3 1
3 2019-03-01 Mike Michelle 9 2
4 2019-04-01 Mike Michelle 2 2
5 2019-01-01 Trish Tim 6 1
6 2019-02-01 Trish Tim 7 1
7 2019-03-01 Trish Tim 6 1
8 2019-04-01 Trish Trevor 1 2
I have tried a few things but they're not quite working out. I have created two separate data frames where one consists of the first instance of that Rep and associated manager (df1) and the other one consists of the last instance of that rep and associated manager (df2). The code that I have tried that has gotten the closest is:
dat$New_Column <- ifelse(dat$Rep %in% df1$Rep & dat$Manager %in% df1$Manager, 1,
ifelse(dat$Rep %in% df2$Rep & dat$Manager %in% df2$Manager, 2, NA))
However this reads as two separate conditions, rather than having a condition of a condition (i.e. If Mike exists in the first instance and Amber exists in the first instance assign 1 rather than If Mike exists with the manager Amber in the first instance assign 1). Any help would be really appreciated. Thank you!
An option is to first grouped by 'Rep' filter the rows where the number of unique 'Manager' is 2, and then add a column by matching the 'Manager' with the unique elements of 'Manager' to get the indices
library(dplyr)
dat %>%
group_by(Rep) %>%
filter(n_distinct(Manager) == 2) %>%
mutate(New_Column = match(Manager, unique(Manager)))
# A tibble: 8 x 5
# Groups: Rep [2]
# Date Rep Manager Sales New_Column
# <chr> <chr> <chr> <int> <int>
#1 2019-01-01 Mike Amber 6 1
#2 2019-02-01 Mike Amber 3 1
#3 2019-03-01 Mike Michelle 9 2
#4 2019-04-01 Mike Michelle 2 2
#5 2019-01-01 Trish Tim 6 1
#6 2019-02-01 Trish Tim 7 1
#7 2019-03-01 Trish Tim 6 1
#8 2019-04-01 Trish Trevor 1 2

Resources