In a data frame that I've called into R, I'm trying to change the dates listed to a different date. For example, I want 2020-06-04 to become 2020-06-03.
Below is code that I've tried to write in order to do this, but haven't succeeded.
I also did this to the data frame prior:
AbsoluteCover$Date <- as.Date(AbsoluteCover$Date,
format = "%m/%d/%y")
1:
AC <- mutate(AbsoluteCover, NewDate = c("2020-06-04" == "2020-06-03" & "2020-06-19" == "2020-06-18" & "2020-07-12" == "2020-07-28"))
This just creates a new column called "NewDate" but with all FALSE in the cells. This outcome makes sense, but it's not what I want.
2:
AC <- AbsoluteCover %>% mutate(Date, "2020-06-04" == "2020-06-03" & "2020-06-19" == "2020-06-18" & "2020-07-12" == "2020-07-28")
This does the same thing as 1 above.
3:
AC <- replace(AbsoluteCover$Date, c("2020-06-04", "2020-06-19", "2020-07-12"), c("2020-06-03", "2020-06-18", "2020-07-28"))
This just returns a data frame with one column with dates.
Here is an example of my data frame:
dput(head(AbsoluteCover))
structure(list(Plot = c("A1", "A1", "A1", "A2", "A2", "A2"),
Date = structure(c(18417, 18432, 18455, 18417, 18432, 18455
), class = "Date"), Cover = c(12L, 34L, 17L, 2L, 50L, 3L)), row.names = c(NA,
-6L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
Plot = c("A1", "A2"), .rows = list(1:3, 4:6)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
As you haven't provided a sample dataframe, I have worked out the example using a test dataset.
You can use the which function to select the rows based on condition
dates = c(as.Date('2020-06-04'), as.Date('2020-01-03'))
df = data.frame('a' = sample(dates, 15, replace= TRUE))
df
#> a
#> 1 2020-01-03
#> 2 2020-06-04
#> 3 2020-06-04
#> 4 2020-01-03
#> 5 2020-06-04
#> 6 2020-06-04
#> 7 2020-01-03
#> 8 2020-06-04
#> 9 2020-01-03
#> 10 2020-01-03
#> 11 2020-01-03
#> 12 2020-01-03
#> 13 2020-01-03
#> 14 2020-06-04
#> 15 2020-06-04
df[which(df$a == as.Date('2020-06-04')), 'a'] = as.Date('2020-06-03')
df
#> a
#> 1 2020-01-03
#> 2 2020-06-03
#> 3 2020-06-03
#> 4 2020-01-03
#> 5 2020-06-03
#> 6 2020-06-03
#> 7 2020-01-03
#> 8 2020-06-03
#> 9 2020-01-03
#> 10 2020-01-03
#> 11 2020-01-03
#> 12 2020-01-03
#> 13 2020-01-03
#> 14 2020-06-03
#> 15 2020-06-03
Created on 2020-07-09 by the reprex package (v0.3.0)
You can use mutateand case_when:
library(dplyr)
df %>% mutate(Date = case_when(
Date == "2020-06-04" ~ "2020-06-03",
Date == "2020-06-19" ~ "2020-06-18",
Date == "2020-07-12" ~ "2020-07-28"))
# A tibble: 6 x 3
# Groups: Plot [2]
Plot Date Cover
<chr> <chr> <int>
1 A1 2020-06-03 12
2 A1 2020-06-18 34
3 A1 2020-07-28 17
4 A2 2020-06-03 2
5 A2 2020-06-18 50
6 A2 2020-07-28 3
Related
I have an odd situation where when I use dplyr::rowwise() and min in mutate, it outputs a single value across all rows rather than by row. It works with my other dataframes in the same session, and not sure what the issue is. I have also restarted my Rstudio.
df <- indf
dplyr::rowwise(.) %>%
mutate(test = min(as.Date(date1), as.Date(date2), na.rm = T)
structure(list(id = structure(c("5001", "3002", "2001", "1001",
"6001", "9001"), label = "Subject name or identifier", format.sas = "$"),
date1 = structure(c(NA, 18599, NA, NA, NA, NA), class = "Date"),
date2 = structure(c(18472, 18597, 18638, 18675, 18678, 18696
), class = "Date"), test = structure(c(18472, 18472, 18472,
18472, 18472, 18472), class = "Date")), class = c("rowwise_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")))
It could be a result of loading plyr package after dplyr which masked the mutate from dplyr
library(dplyr)
indf %>%
rowwise %>%
plyr::mutate(test = min(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
# Rowwise:
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-07-29
3 2001 NA 2021-01-11 2020-07-29
4 1001 NA 2021-02-17 2020-07-29
5 6001 NA 2021-02-20 2020-07-29
6 9001 NA 2021-03-10 2020-07-29
versus using :: to load the function from dplyr
> indf %>%
rowwise %>%
dplyr::mutate(test = min(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
# Rowwise:
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-12-01
3 2001 NA 2021-01-11 2021-01-11
4 1001 NA 2021-02-17 2021-02-17
5 6001 NA 2021-02-20 2021-02-20
6 9001 NA 2021-03-10 2021-03-10
Note that rowwise is slow, it may be better to use vectorized pmin
indf %>%
ungroup %>%
dplyr::mutate(test = pmin(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-12-01
3 2001 NA 2021-01-11 2021-01-11
4 1001 NA 2021-02-17 2021-02-17
5 6001 NA 2021-02-20 2021-02-20
6 9001 NA 2021-03-10 2021-03-10
This question already has answers here:
How to join two dataframes by nearest time-date?
(2 answers)
Closed last year.
I've seen various solutions for this question based on date only, but the time component is tripping me up. I have two data frames with POSIXct columns called 'datetime'. For DF1 that column has data rounded to the nearest hour. For DF2, the time component is not rounded to the nearest hour and can occur anytime. The dataframes look like this:
DF1
datetime
X
Y
Z
2020-09-01 03:00:00
1
3
4
2020-09-02 12:00:00
12
3
5
2020-09-02 22:00:00
4
9
19
2020-09-03 01:00:00
4
10
2
2020-09-04 06:00:00
4
12
1
2020-09-04 08:00:00
11
13
10
DF2
datetime
Var
2020-09-01 02:23:14
A
2020-09-01 03:12:09
B
2020-09-02 11:52:15
A
2020-09-02 12:15:44
B
2020-09-02 22:31:56
A
2020-09-02 21:38:05
B
2020-09-03 01:11:39
A
2020-09-03 00:59:33
B
2020-09-04 05:12:19
A
2020-09-04 06:07:09
B
2020-09-04 08:22:28
A
2020-09-04 07:50:17
B
What I want is to merge these two dataframes based on this column using the date and time that are closest in time to 'datetime' in DF1, so that it looks like this:
datetime
X
Y
Z
Var
2020-09-01 03:00:00
1
3
4
B
2020-09-02 12:00:00
12
3
5
A
2020-09-02 22:00:00
4
9
19
B
2020-09-03 01:00:00
4
10
2
B
2020-09-04 06:00:00
4
12
1
B
2020-09-04 08:00:00
11
13
10
B
Thank you!
Adding helper columns for merge and group_by, using merge and then dplyr for the filtering
library(dplyr)
df1$tmp <- as.Date(df1$datetime)
df2$tmp <- as.Date(df2$datetime)
df1$grp <- 1:(nrow(df1))
merge(df1, df2, "tmp") %>%
group_by(grp) %>%
slice(which.min(abs(difftime(datetime.x, datetime.y)))) %>%
ungroup() %>%
select(-c(tmp,grp,datetime.y))
# A tibble: 6 × 5
datetime.x X Y Z Var
<chr> <int> <int> <int> <chr>
1 2020-09-01 03:00:00 1 3 4 B
2 2020-09-02 12:00:00 12 3 5 A
3 2020-09-02 22:00:00 4 9 19 B
4 2020-09-03 01:00:00 4 10 2 B
5 2020-09-04 06:00:00 4 12 1 B
6 2020-09-04 08:00:00 11 13 10 B
Data
df1 <- structure(list(datetime = c("2020-09-01 03:00:00", "2020-09-02 12:00:00",
"2020-09-02 22:00:00", "2020-09-03 01:00:00", "2020-09-04 06:00:00",
"2020-09-04 08:00:00"), X = c(1L, 12L, 4L, 4L, 4L, 11L), Y = c(3L,
3L, 9L, 10L, 12L, 13L), Z = c(4L, 5L, 19L, 2L, 1L, 10L)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(datetime = c("2020-09-01 02:23:14", "2020-09-01 03:12:09",
"2020-09-02 11:52:15", "2020-09-02 12:15:44", "2020-09-02 22:31:56",
"2020-09-02 21:38:05", "2020-09-03 01:11:39", "2020-09-03 00:59:33",
"2020-09-04 05:12:19", "2020-09-04 06:07:09", "2020-09-04 08:22:28",
"2020-09-04 07:50:17"), Var = c("A", "B", "A", "B", "A", "B",
"A", "B", "A", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-12L))
I have two datasets, one with values at specific time points for different IDs and another one with several time frames for the IDs. Now I want to check if the timepoint in dataframe one is within any of the time frames from dataset 2 matching the ID.
For example:
df1:
ID date time
1 2020-04-14 11:00:00
1 2020-04-14 18:00:00
1 2020-04-15 10:00:00
1 2020-04-15 20:00:00
1 2020-04-16 11:00:00
1 ...
2 ...
df2:
ID start end
1 2020-04-14 16:00:00 2020-04-14 20:00:00
1 2020-04-15 18:00:00 2020-04-16 13:00:00
2 ...
2
what I want
df1_new:
ID date time mark
1 2020-04-14 11:00:00 0
1 2020-04-14 18:00:00 1
1 2020-04-15 10:00:00 0
1 2020-04-15 20:00:00 1
1 2020-04-16 11:00:00 1
1 ...
2 ...
Any help would be appreciated!
An option could be:
library(tidyverse)
library(lubridate)
#> date, intersect, setdiff, union
df_1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), date = c("14.04.2020",
"14.04.2020", "15.04.2020", "15.04.2020", "16.04.2020"), time = c("11:00:00",
"18:00:00", "10:00:00", "20:00:00", "11:00:00"), date_time = structure(c(1586862000,
1586887200, 1586944800, 1586980800, 1587034800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-5L))
df_2 <- structure(list(ID = c(1L, 1L), start = c("14.04.2020 16:00",
"15.04.2020 18:00"), end = c("14.04.2020 20:00", "16.04.2020 13:00"
)), class = "data.frame", row.names = c(NA, -2L))
df_22 <- df_2 %>%
mutate(across(c("start", "end"), dmy_hm)) %>%
group_nest(ID)
left_join(x = df_1, y = df_22, by = "ID") %>%
as_tibble() %>%
mutate(mark = map2_dbl(date_time, data, ~+any(.x %within% interval(.y$start, .y$end)))) %>%
select(-data)
#> # A tibble: 5 x 5
#> ID date time date_time mark
#> <int> <chr> <chr> <dttm> <dbl>
#> 1 1 14.04.2020 11:00:00 2020-04-14 11:00:00 0
#> 2 1 14.04.2020 18:00:00 2020-04-14 18:00:00 1
#> 3 1 15.04.2020 10:00:00 2020-04-15 10:00:00 0
#> 4 1 15.04.2020 20:00:00 2020-04-15 20:00:00 1
#> 5 1 16.04.2020 11:00:00 2020-04-16 11:00:00 1
Created on 2021-05-25 by the reprex package (v2.0.0)
I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?
If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8
This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))
It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
I need to calculate the number of buyers in a store at each hour of the day. I have reproduced the data from another similar problem but that seemed not to answer the problem I am looking for. I do not want to calculated the length of stay in the store but want to calculate the occupancy of the store, by counting all buyers in the store, at each hour of the day. I need to do this only with tidyverse and lubridate.
df <- structure(list(ID = c(101, 102, 103, 104, 105, 106, 107),
Time_in = structure(c(1326309720, 1326309900, 1328990700,
1328997240, 1329000840, 1329004440,
1329004680),
class = c("POSIXct", "POSIXt"), tzone = ""),
Time_out = structure(c(1326313800, 1326317340, 1326317460,
1326324660, 1326328260, 1326335460,
1326335460),
class = c("POSIXct", "POSIXt"), tzone = "")), .Names =
c("ID", "Adm", "Disc"),
row.names = c(NA, -7L), class = "data.frame")
Assuming Adm and Disc are an action they perform in the shop.
Using the count on year month day hour here makes it possible to scale this to whatever year you want.
df <- structure(list(ID = c(101, 102, 103, 104, 105, 106, 107),
Adm = structure(c(1326309720, 1326309900, 1328990700,
1328997240, 1329000840, 1329004440,
1329004680),
class = c("POSIXct", "POSIXt"), tzone = ""),
Disc = structure(c(1326313800, 1326317340, 1326317460,
1326324660, 1326328260, 1326335460,
1326335460),
class = c("POSIXct", "POSIXt"), tzone = "")), .Names =
c("ID", "Adm", "Disc"),
row.names = c(NA, -7L), class = "data.frame")
library(tidyverse)
library(lubridate)
#>
#> Attachement du package : 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
by_hours <- df %>%
gather(key = Type, Time, 2:3)
by_hours
#> ID Type Time
#> 1 101 Adm 2012-01-11 20:22:00
#> 2 102 Adm 2012-01-11 20:25:00
#> 3 103 Adm 2012-02-11 21:05:00
#> 4 104 Adm 2012-02-11 22:54:00
#> 5 105 Adm 2012-02-11 23:54:00
#> 6 106 Adm 2012-02-12 00:54:00
#> 7 107 Adm 2012-02-12 00:58:00
#> 8 101 Disc 2012-01-11 21:30:00
#> 9 102 Disc 2012-01-11 22:29:00
#> 10 103 Disc 2012-01-11 22:31:00
#> 11 104 Disc 2012-01-12 00:31:00
#> 12 105 Disc 2012-01-12 01:31:00
#> 13 106 Disc 2012-01-12 03:31:00
#> 14 107 Disc 2012-01-12 03:31:00
by_hours %>%
mutate(
Time = ymd_hms(Time),
year = year(Time),
month = month(Time),
day = day(Time),
hour = hour(Time),
) %>%
count(year, month, day, hour)
#> # A tibble: 10 x 5
#> year month day hour n
#> <dbl> <dbl> <int> <int> <int>
#> 1 2012 1 11 20 2
#> 2 2012 1 11 21 1
#> 3 2012 1 11 22 2
#> 4 2012 1 12 0 1
#> 5 2012 1 12 1 1
#> 6 2012 1 12 3 2
#> 7 2012 2 11 21 1
#> 8 2012 2 11 22 1
#> 9 2012 2 11 23 1
#> 10 2012 2 12 0 2
Created on 2018-07-17 by the reprex package (v0.2.0).