This question already has answers here:
Subsetting a dataframe for a specified month and year
(3 answers)
subset function with "different than"?
(3 answers)
Closed 1 year ago.
How do I delete the April dates that are in the date2 column? Here is a small example, but I have a much larger database. So, would I be able to do this quickly?
Thanks!
data <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20"),
date2 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-06-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-05-03","2021-06-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-06-09","2021-05-09","2021-08-10","2021-06-10"),
DR01= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR02 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
We could use month function in lubridate and then filter:
library(dplyr)
library(lubridate)
data %>%
filter(month(date2)!=4)
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
5 1 2021-06-20 2021-06-02 7 1
6 1 2021-06-20 2021-05-03 3 3
7 1 2021-06-20 2021-06-03 4 4
8 1 2021-06-20 2021-06-09 2 2
9 1 2021-06-20 2021-05-09 6 6
10 1 2021-06-20 2021-08-10 4 4
11 1 2021-06-20 2021-06-10 3 3
Extract the month part after converting to Date class and use !=
data2 <- subset(data, format(as.Date(date2), '%m') != '04')
-output
data2
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
7 1 2021-06-20 2021-06-02 7 1
12 1 2021-06-20 2021-05-03 3 3
13 1 2021-06-20 2021-06-03 4 4
18 1 2021-06-20 2021-06-09 2 2
19 1 2021-06-20 2021-05-09 6 6
20 1 2021-06-20 2021-08-10 4 4
21 1 2021-06-20 2021-06-10 3 3
Another option without using any dates:
data[!grepl("-04-", data$date2), ]
We interprete date2 as string and look for any cell without a "-04-". This returns
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
7 1 2021-06-20 2021-06-02 7 1
12 1 2021-06-20 2021-05-03 3 3
13 1 2021-06-20 2021-06-03 4 4
18 1 2021-06-20 2021-06-09 2 2
19 1 2021-06-20 2021-05-09 6 6
20 1 2021-06-20 2021-08-10 4 4
21 1 2021-06-20 2021-06-10 3 3
Related
I want to find weeks within months (separate numbering of weeks within months) using lubridate R package. My minimum working example is below:
library(tidyverse)
library(lubridate)
dt1 <-
tibble(
Date = seq(from = ymd("2021-01-01"), to = ymd("2021-12-31"), by = '1 day')
, Month = month(Date)
)
dt2 <-
dt1 %>%
group_by(Month) %>%
mutate(Week = week(Date))
dt2 %>%
print(n = 40)
# A tibble: 365 x 3
# Groups: Month [12]
Date Month Week
<date> <dbl> <dbl>
1 2021-01-01 1 1
2 2021-01-02 1 1
3 2021-01-03 1 1
4 2021-01-04 1 1
5 2021-01-05 1 1
6 2021-01-06 1 1
7 2021-01-07 1 1
8 2021-01-08 1 2
9 2021-01-09 1 2
10 2021-01-10 1 2
11 2021-01-11 1 2
12 2021-01-12 1 2
13 2021-01-13 1 2
14 2021-01-14 1 2
15 2021-01-15 1 3
16 2021-01-16 1 3
17 2021-01-17 1 3
18 2021-01-18 1 3
19 2021-01-19 1 3
20 2021-01-20 1 3
21 2021-01-21 1 3
22 2021-01-22 1 4
23 2021-01-23 1 4
24 2021-01-24 1 4
25 2021-01-25 1 4
26 2021-01-26 1 4
27 2021-01-27 1 4
28 2021-01-28 1 4
29 2021-01-29 1 5
30 2021-01-30 1 5
31 2021-01-31 1 5
32 2021-02-01 2 5
33 2021-02-02 2 5
34 2021-02-03 2 5
35 2021-02-04 2 5
36 2021-02-05 2 6
37 2021-02-06 2 6
38 2021-02-07 2 6
39 2021-02-08 2 6
40 2021-02-09 2 6
# ... with 325 more rows
Wondering what am I missing here. For row number 31 in output (31 2021-01-31 1 5), the value in Week column should be 1. Any lead to get the desired output.
It's not completely clear how you are defining a week. If Week 1 starts on the first day of a month, then you can do:
dt2 <- dt1 %>% mutate(Week = 1L + ((day(Date) - 1L) %/% 7L))
dt2 %>% slice(21:40) %>% print(n = 20L)
# A tibble: 20 × 3
Date Month Week
<date> <dbl> <int>
1 2021-01-21 1 3
2 2021-01-22 1 4
3 2021-01-23 1 4
4 2021-01-24 1 4
5 2021-01-25 1 4
6 2021-01-26 1 4
7 2021-01-27 1 4
8 2021-01-28 1 4
9 2021-01-29 1 5
10 2021-01-30 1 5
11 2021-01-31 1 5
12 2021-02-01 2 1
13 2021-02-02 2 1
14 2021-02-03 2 1
15 2021-02-04 2 1
16 2021-02-05 2 1
17 2021-02-06 2 1
18 2021-02-07 2 1
19 2021-02-08 2 2
20 2021-02-09 2 2
With base R, you could simply do:
Week <- 1L + ((as.POSIXlt(Date)$mday - 1L) %/% 7L)
Could you help me solve the problem below: as you can see in the second part of the code I exclude the DR that have all columns that are equal to 0. However, in the third part of the code, I need to select D1 until the last column DR, for the sum to be done. But it gives an error, could you help me solve the problem?
library(dplyr)
df1 <- structure(
list(date1 = c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
"2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28"),
date2 = c("2021-04-02","2021-04-02","2021-04-03","2021-04-08","2021-04-09","2021-04-10","2021-07-01","2021-07-02","2021-07-03","2021-07-03"),
Week= c("Friday","Friday","Saturday","Thursday","Friday","Saturday","Thursday","Friday","Saturday","Saturday"),
D1 = c(2,3,4,4,6,3,4,5,6,2), DR01 = c(4,1,4,3,3,4,3,6,3,2), DR02= c(4,2,6,7,3,2,7,4,4,3),DR03= c(9,5,4,3,3,2,1,5,4,3),
DR04 = c(5,4,3,3,3,6,2,1,9,2),DR05 = c(5,4,5,3,6,2,1,9,3,4),
DR06 = c(2,4,4,3,3,5,6,7,8,3),DR07 = c(2,5,4,4,9,4,7,8,3,3),
DR08 = c(0,0,0,0,1,2,0,0,0,0),DR09 = c(0,0,0,0,0,0,0,0,0,0),DR010 = c(0,0,0,0,0,0,0,0,0,0),DR011 = c(0,4,0,0,0,0,0,0,0,0), DR012 = c(0,0,0,0,0,0,0,0,0,0)),
class = "data.frame", row.names = c(NA, -10L))
df1<-df1 %>%
select(!where(~ is.numeric(.) && all(. == 0)))
df1<-df1 %>%
group_by(date1,date2, Week) %>%
select(D1:DR012) %>%
summarise_all(sum)
We can have the select before
library(dplyr)
df1 %>%
select(date1, date2, Week, matches("^D")) %>%
group_by(date1, date2, Week) %>%
summarise(across(everything(), sum), .groups = 'drop')
-output
# A tibble: 8 × 13
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-06-28 2021-04-02 Friday 5 5 6 14 9 9 6 7 0 4
2 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
3 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
4 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
5 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
6 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
7 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
8 2021-06-28 2021-07-03 Saturday 8 5 7 7 11 7 11 6 0 0
After we did the select, it is not clear why we have to select again. It is not really needed as summarise with across can be everything() other than the grouping columns
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
group_by(across(date1:Week)) %>%
summarise(across(everything(), sum), .groups = 'drop')
# A tibble: 8 × 13
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-06-28 2021-04-02 Friday 5 5 6 14 9 9 6 7 0 4
2 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
3 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
4 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
5 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
6 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
7 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
8 2021-06-28 2021-07-03 Saturday 8 5 7 7 11 7 11 6 0 0
We could use summarise with across:
library(dplyr)
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
group_by(date1,date2, Week) %>%
summarise(across(where(is.numeric), sum))
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
1 2021-06-28 2021-04-02 Friday 2 4 4 9 5 5 2 2 0 0
2 2021-06-28 2021-04-02 Friday 3 1 2 5 4 4 4 5 0 4
3 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
4 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
5 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
6 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
7 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
8 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
9 2021-06-28 2021-07-03 Saturday 6 3 4 4 9 3 8 3 0 0
10 2021-06-28 2021-07-03 Saturday 2 2 3 3 2 4 3 3 0 0
DR012 is filtered, so it does not exist anymore to select:
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
names()
[1] "date1" "date2" "Week" "D1" "DR01" "DR02" "DR03" "DR04" "DR05"
[10] "DR06" "DR07" "DR08" "DR011"
Change your code to
df1 %>%
group_by(date1,date2, Week) %>%
select(D1:DR011) %>%
summarise_all(sum)
or
df1 %>%
group_by(date1,date2, Week) %>%
select(starts_with("D")) %>%
summarise_all(sum)
The code below generates a scatter plot with three horizontal lines, which refer to mean, mean+standard deviation and mean - standard deviation. To calculate these three factors, all the dates in my data database are being considered.
However, I would like to exclude the month of April for calculating the mean and standard deviation, how could I do that?
Executable code below:
library(dplyr)
library(tidyr)
library(lubridate)
data <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20"),
date2 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-07-09","2021-07-09","2021-07-10","2021-07-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
graph <- function(dt, dta = data) {
dim_data<-dim(data)
day<-c(seq.Date(from = as.Date(data$date2[1]), by = "days",
length = dim_data[1]
))
data_grouped <- data %>%
mutate(across(starts_with("date"), as.Date)) %>%
group_by(date2) %>%
summarise(Id = first(Id),
date1 = first(date1),
Week = first(Week),
DTPE = first(DTPE),
D1 = sum(D1)) %>%
select(Id,date1,date2,Week,DTPE,D1)
data_grouped %>%
mutate(DTPE = na_if(DTPE, ""))
df_OC<-subset(data_grouped, DTPE == "")
ds_CO = df_OC %>% filter(weekdays(date2) %in% weekdays(as.Date(dt)))
mean<-mean(ds_CO$D1)
sd<-sd(ds_CO$D1)
dta %>%
filter(date2 == ymd(dt)) %>%
summarize(across(starts_with("DR"), sum)) %>%
pivot_longer(everything(), names_pattern = "DR(.+)", values_to = "val") %>%
mutate(name = as.numeric(name)) %>%
plot(xlab = "Days", ylab = "Number", xlim = c(0, 45),cex=1.5,cex.lab=1.5,
cex.axis=1.5, cex.main=2, cex.sub=2, lwd=2.5, ylim = c((min(.$val) %/% 10) * 15, (max(.$val) %/% 10 + 1) * 100))
abline(h=mean, col='blue') +
abline(h=(mean + sd), col='green',lty=2)
abline(h=(mean - sd), col='orange',lty=2)
}
graph("2021-07-10",data)
data %>%
filter("04" != format(as.Date(date2), format = "%m"))
# Id date1 date2 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
# 1 1 2021-06-20 2021-07-01 Thursday Ho 8 4 8 7 4 9
# 2 1 2021-06-20 2021-07-01 Thursday Ho 1 1 1 5 5 5
# 3 1 2021-06-20 2021-07-01 Thursday Ho 9 4 4 4 6 4
# 4 1 2021-06-20 2021-07-01 Thursday Ho 3 3 3 3 7 3
# 5 1 2021-06-20 2021-07-09 Friday 2 2 2 2 2 2
# 6 1 2021-06-20 2021-07-09 Friday 6 6 6 6 6 6
# 7 1 2021-06-20 2021-07-10 Saturday Ho 2 7 2 4 4 4
# 8 1 2021-06-20 2021-07-10 Saturday Ho 3 3 3 3 3 3
(I recommend you permanently make date1 and date2 proper Date objects in the frame instead of converting it every time you do something. While the conversion is relatively inexpensive, it's also unnecessary, and the consequence of forgetting it might be subtle differences in the results (i.e., treating it as a categorical variable vice continuous/discrete-ordinal).
You already use lubridate therefore you could apply month function from lubridate package:
data %>%
filter(month(date2) != 4)
Id date1 date2 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-06-20 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-06-20 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-06-20 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-06-20 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-06-20 2021-07-09 Friday 2 2 2 2 2 2
6 1 2021-06-20 2021-07-09 Friday 6 6 6 6 6 6
7 1 2021-06-20 2021-07-10 Saturday Ho 2 7 2 4 4 4
8 1 2021-06-20 2021-07-10 Saturday Ho 3 3 3 3 3 3
Using substr
subset(data, substr(date2, 6, 7 ) != '04')
-ouptut
Id date1 date2 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-06-20 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-06-20 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-06-20 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-06-20 2021-07-01 Thursday Ho 3 3 3 3 7 3
18 1 2021-06-20 2021-07-09 Friday 2 2 2 2 2 2
19 1 2021-06-20 2021-07-09 Friday 6 6 6 6 6 6
20 1 2021-06-20 2021-07-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-06-20 2021-07-10 Saturday Ho 3 3 3 3 3 3
I would like to create a new database from the df database I entered below. My idea is to create a base where only have one day per line. For example, instead of inserting 4 rows for 01/07/2021, it will only be 1, this way the values of the columns of those days will be added.
df <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-04-07","2021-04-09","2021-04-10","2021-04-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
> df
Id date1 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-04-02 Friday 5 3 3 3 3 3
6 1 2021-04-02 Friday 4 4 4 4 2 2
7 1 2021-04-02 Friday 7 3 1 1 7 1
8 1 2021-04-02 Friday 6 6 6 5 4 5
9 1 2021-04-02 Friday 3 3 3 3 2 3
10 1 2021-04-02 Friday 8 7 7 3 1 7
11 1 2021-04-03 Saturday 2 2 2 2 2 2
12 1 2021-04-03 Saturday 3 3 3 3 3 3
13 1 2021-04-03 Saturday 4 4 4 4 4 4
14 1 2021-04-03 Saturday 6 6 6 6 6 7
15 1 2021-04-03 Saturday 7 7 7 7 7 7
16 1 2021-04-08 Thursday 8 8 8 8 8 8
17 1 2021-04-08 Thursday 4 4 4 4 4 4
18 1 2021-04-07 Friday 2 2 2 2 2 2
19 1 2021-04-09 Friday 6 6 6 6 6 6
20 1 2021-04-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-04-10 Saturday Ho 3 3 3 3 3 3
We may do a grouping by 'Id', along with 'date1' and 'Week', then summarise the numeric columns to get the sum in across
library(dplyr)
df %>% group_by(Id, date1, Week) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE), .groups = 'drop')
You can perform this using the following code:
library(dplyr)
df %>%
group_by(Id, date1, Week) %>%
select(D1:DR05) %>%
summarise_all(sum)
# A tibble: 7 × 9
# Groups: Id, date1 [7]
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-03 Saturday 22 22 22 22 22 23
3 1 2021-04-07 Friday 2 2 2 2 2 2
4 1 2021-04-08 Thursday 12 12 12 12 12 12
5 1 2021-04-09 Friday 6 6 6 6 6 6
6 1 2021-04-10 Saturday 5 10 5 7 7 7
7 1 2021-07-01 Thursday 21 12 16 19 22 21
You might want to also convert the date1 field to a DATE object, but can do that using the lubridate verbs for e.g. ymd() inside a mutate
Base R with aggregate:
aggregate(cbind(D1, DR01, DR02, DR03, DR04, DR05) ~ Id+date1+Week, df, sum)
Output:
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-07 Friday 2 2 2 2 2 2
3 1 2021-04-09 Friday 6 6 6 6 6 6
4 1 2021-04-03 Saturday 22 22 22 22 22 23
5 1 2021-04-10 Saturday 5 10 5 7 7 7
6 1 2021-04-08 Thursday 12 12 12 12 12 12
7 1 2021-07-01 Thursday 21 12 16 19 22 21
I'm looking for a way, to calculate the number of days a participant (id) spent in a study.
An exemplary data file looks like this:
data <- data.frame(date = as.Date(c("2020-11-29", "2020-11-30", "2020-12-02",
"2020-12-04", "2020-12-05", "2020-12-08",
"2020-11-22", "2020-11-21", "2020-11-24",
"2020-11-25", "2020-11-30", "2020-11-29",
"2021-01-29", "2021-01-20", "2021-01-30",
"2021-02-01", "2021-02-04", "2021-02-04")),
id = rep(1:3, each = 6))
data <- dplyr::arrange(data, id, date)
data
date id
1 2020-11-29 1
2 2020-11-30 1
3 2020-12-02 1
4 2020-12-04 1
5 2020-12-05 1
6 2020-12-08 1
7 2020-11-21 2
8 2020-11-22 2
9 2020-11-24 2
10 2020-11-25 2
11 2020-11-29 2
12 2020-11-30 2
13 2021-01-20 3
14 2021-01-29 3
15 2021-01-30 3
16 2021-02-01 3
17 2021-02-04 3
18 2021-02-04 3
What i'd like to have, is new column days_from_start that will take the 1st day for every id and set it to 0. Then it will compute number of days for every other row within each id. Something like this:
data$days_from_start <- c(0, 1, 3, 4, 5, 8,
0, 1, 3, 4, 8, 10,
0, 9, 10, 11, 14, 14)
data
date id days_from_start
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 4
5 2020-12-05 1 5
6 2020-12-08 1 8
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 10
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 11
17 2021-02-04 3 14
18 2021-02-04 3 14
Any ideas?
Thank you
Simply group the data, work out the earliest date for each id and then calculate differences.
data <- dplyr::arrange(data, id, date)
data %>%
group_by(id) %>%
mutate(
start_date=min(date),
days_from_start=as.numeric(date-start_date)
) %>%
ungroup() %>%
select(-start_date)
# A tibble: 18 x 3
date id days_from_start
<date> <int> <dbl>
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 5
5 2020-12-05 1 6
6 2020-12-08 1 9
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 9
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 12
17 2021-02-04 3 15
18 2021-02-04 3 15