Splitting multiple date and time variables & computing time average in R - r

I have the following dataset wherein, I have the person's ID, district and sub-district they live in along with the last date/time on which they uploaded data to the server. The variables "last_down_" contain the last date/time on which a person the uploaded data and are named in such a way that they show the date on which I had downloaded the data on the same. For example, "last_upload_2020-06-12" would mean I downloaded the data from the server on 12th June.
For the below dataset, I would like to spilt the date and time in each of the variables (all at once) in a way that the new separated variables which are created go by the name "last_date_(my download date)" & "last_time_(my download date)"
district block id last_upload_2020-06-12 last_upload_2020-06-13 last_upload_2020-06-14 last_upload_2020-06-15
A X 11 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0
A X 12 2020-06-11 12:40:26.0 2020-06-11 12:40:26.0 2020-06-14 11:40:26.0 2020-06-15 18:50:26.0
A X 2020-06-14 11:08:12.0 2020-06-14 11:08:12.0
A X 14 2020-06-12 11:31:07.0 2020-06-13 11:31:07.0 2020-06-14 17:37:07.0 2020-06-14 17:37:07.0
A Y 15 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0
A Y 16 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0
A Y 17 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0
A Y 18 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0
A Z 19 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0
A Z 20 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0
A Z 21 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0
A Z 22 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0
A X 23 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0
A X 24 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-15 11:31:33.0
A Y 25
A Y 26 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0
B E 31
B C 32 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0
B C 33 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0
B C 34 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-15 14:35:41.0
B C 35 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0
B D 36 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-15 13:02:39.0
B D 37 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0
B D 38 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0
B E 39 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0
B E 40 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0
B E 41 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0
B C 42 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0
B D 43 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0
N.B: my date/time variables are numeric.
Once I get the data in shape, I would also like to do the following:
Get the year and month of all observations under "last_upload_2020-06-12" in a separate column.
Similarly, for the last date in my dataset that is "last_upload_2020-06-15". Can I automate R picking the last date something like Sys.Date()-1? I will always have the data for one date less than current.
Calculate the average upload time per ID, i.e., generally around what time does a person upload data to the server? Average should be based on unique time values.
Would be extremely helpful if someone could help solve this!
Thanks,
Rachita

You can try the following code in your original data set. This might help you to answer the introductory, first part, third part and lastly the second part of the question.
library(lubridate)
library(tidyverse)
district <- c("A","A","B","B","C","C")
block <- c("X","Y","Z","X","Y","Z")
id <- c(11,11,12,12,13,13)
upload_dt <- ymd_hms(c("2020-06-13 11:31:07",
"2020-04-12 14:23:29",
"2020-04-30 13:51:20",
"2020-06-12 11:53:15",
"2019-09-28 02:08:51",
"2020-04-12 16:23:29"))
df <- data.frame(district, block, id, upload_dt)
df <- df %>%
separate(upload_dt, into = c("date","time"),
sep = " ", remove = F)
df$upload_date <- paste("last_upload_date_is", df$date)
df$upload_time <- paste("last_upload_time_is", df$time)
df <- df %>%
mutate(date_added = ymd(df$date),
year_upload = year(date),
month_upload = month(date))
df
The output for introductory and first part of the question is as follows:-
district block id upload_dt date time upload_date
1 A X 11 2020-06-13 11:31:07 2020-06-13 11:31:07 last_upload_date_is 2020-06-13
2 A Y 11 2020-04-12 14:23:29 2020-04-12 14:23:29 last_upload_date_is 2020-04-12
3 B Z 12 2020-04-30 13:51:20 2020-04-30 13:51:20 last_upload_date_is 2020-04-30
4 B X 12 2020-06-12 11:53:15 2020-06-12 11:53:15 last_upload_date_is 2020-06-12
5 C Y 13 2019-09-28 02:08:51 2019-09-28 02:08:51 last_upload_date_is 2019-09-28
6 C Z 13 2020-04-12 16:23:29 2020-04-12 16:23:29 last_upload_date_is 2020-04-12
upload_time date_added year_upload month_upload
1 last_upload_time_is 11:31:07 2020-06-13 2020 6
2 last_upload_time_is 14:23:29 2020-04-12 2020 4
3 last_upload_time_is 13:51:20 2020-04-30 2020 4
4 last_upload_time_is 11:53:15 2020-06-12 2020 6
5 last_upload_time_is 02:08:51 2019-09-28 2019 9
6 last_upload_time_is 16:23:29 2020-04-12 2020 4
The code and the output for the third part of the question is as follows:-
df %>% group_by(id) %>%
summarise(avg_time_per_id = format(mean(strptime(time, "%H:%M:%S")), "%H:%M:%S")) %>%
ungroup()
# A tibble: 3 x 2
id avg_time_per_id
<dbl> <chr>
1 11 12:57:18
2 12 12:52:17
3 13 09:16:10
The code and the output for the second part of the question is as follows:-
(Note for this I have created a new data frame.) You can apply this solution to the existing data set.
df <- data.frame(
id = c(1:5),
district = c("X","Y","X","Y","X"),
block = c("A","A","B","B","C"),
upload_date_a = paste0(rep("2020-06-13"), " ", rep("11:31:07")),
upload_date_b = paste0(rep("2010-08-15"), " ", rep("02:45:27")),
upload_date_c = paste0(rep("2000-10-30"), " ", rep("16:45:51")),
stringsAsFactors = F
)
col_ind <- grep(x = names(df), pattern = "upload_date", value = T, ignore.case = T)
cols_list <- lapply(seq_along(col_ind), function(x){
q1 <- do.call(rbind, strsplit(df[[col_ind[[x]]]], split = " "))
q2 <- data.frame(q1, stringsAsFactors = F)
i <- ncol(q2)
colnames(q2) <- paste0(col_ind[[x]], c(1:i))
return(q2)
}
)
df_new <- cbind(df[1:3], do.call(cbind, cols_list))
df_new
id district block upload_date_a1 upload_date_a2 upload_date_b1
1 1 X A 2020-06-13 11:31:07 2010-08-15
2 2 Y A 2020-06-13 11:31:07 2010-08-15
3 3 X B 2020-06-13 11:31:07 2010-08-15
4 4 Y B 2020-06-13 11:31:07 2010-08-15
5 5 X C 2020-06-13 11:31:07 2010-08-15
upload_date_b2 upload_date_c1 upload_date_c2
1 02:45:27 2000-10-30 16:45:51
2 02:45:27 2000-10-30 16:45:51
3 02:45:27 2000-10-30 16:45:51
4 02:45:27 2000-10-30 16:45:51
5 02:45:27 2000-10-30 16:45:51

The Df looked so complicated that I thought it might be better to replicate it.
I then used a function to take every column you wanted and separate it into the last_date and last_time as wanted. Inside the function the temporary DF is cbind to a DF built outside of the loop. This DF consisted out of the columns which are not treated in the loop.
The result of this loop is the DF as wanted. [colnames got a little long]
The key for the second task was to transfer to last_time to hours, then grouping und summarizing.
I hope this is what you wanted.
I think with this as a basis you can deal with no2.
There were some warnings which had to do with NA's.
More explanation in the reprex below.
library(tidyverse)
df <- read.table(text = '
district block id last_upload_2020_06_12 last_upload_2020_06_13 last_upload_2020_06_14 last_upload_2020_06_15
"A" "X" 11 "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0"
"A" "X" 12 "2020-06-11 12:40:26.0" "2020-06-11 12:40:26.0" "2020-06-14 11:40:26.0" "2020-06-15 18:50:26.0"
"A" "X" NA "NA" "NA" "2020-06-14 11:0812.0" "2020-06-14 11:0812.0"
"A" "X" 14 "2020-06-12 11:31:07.0" "2020-06-13 11:31:07.0" "2020-06-14 17:37:07.0" "2020-06-14 17:37:07.0"
"A" "Y" 15 "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0"
"A" "Y" 16 "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0"
"A" "Y" 17 "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0"
"A" "Y" 18 "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0"
"A" "Z" 19 "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0"
"A" "Z" 20 "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0"
"A" "Z" 21 "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0"
"A" "Z" 22 "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0"
"A" "X" 23 "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0"
"A" "X" 24 "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-15 11:31:33.0"
"A" "Y" 25 " " "" "" ""
"A" "Y" 26 "2020-05-3012:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0"
"B" "E" 31 "" "" "" "" ""
"B" "C" 32 "2020-06-1 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0"
"B" "C" 33 "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0"
"B" "C" 34 "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-15 14:35:41.0"
"B" "C" 35 "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0"
"B" "D" 36 "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-15 13:02:39.0"
"B" "D" 37 "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0"
"B" "D" 38 "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0"
"B" "E" 39 "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0"
"B" "E" 40 "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0"
"B" "E" 41 "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0"
"B" "C" 42 "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0"
"B" "D" 43 "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0"
', header =T)
# TASK: create for each column which contains 'last_upload' new columns
# with date and time
# get the colnames of the cols to be split or separated
ccl <- colnames(df %>% select(last_upload_2020_06_12:last_upload_2020_06_15))
# create new DF with first 3 columns, to which other columns are bound in
# the following function
dff <- df %>% select(district:id)
# function to separate each col in ccl to _date and _time
for (cl in ccl) {
tmp <- separate(df,
col = cl, sep = " ",
into = c(paste0(cl, "_date"), paste0(cl, "_time"))
) %>%
select(contains("_date") | contains("_time"))
dff <- cbind(dff, tmp)
}
dff %>% head()
#> district block id last_upload_2020_06_12_date last_upload_2020_06_12_time
#> 1 A X 11 2020-02-06 11:53:19.0
#> 2 A X 12 2020-06-11 12:40:26.0
#> 3 A X NA <NA> <NA>
#> 4 A X 14 2020-06-12 11:31:07.0
#> 5 A Y 15 2020-06-10 12:45:48.0
#> 6 A Y 16 2020-04-04 02:26:57.0
#> last_upload_2020_06_13_date last_upload_2020_06_13_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-11 12:40:26.0
#> 3 <NA> <NA>
#> 4 2020-06-13 11:31:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_14_date last_upload_2020_06_14_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-14 11:40:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_15_date last_upload_2020_06_15_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-15 18:50:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
# TASK: Calculate the average time of a day each id does a download
# new DF from original brought into long format
# split the date/time into last_date and last_time
ddf <- df %>%
pivot_longer(cols = last_upload_2020_06_12:last_upload_2020_06_15) %>%
separate(col = value, sep = ' ', into = c('last_date', 'last_time')) %>%
mutate(last_date = lubridate::ymd(last_date), last_time= lubridate::hms(last_time))
# calculating the mean hour of the day at which each id does a
# download, by calculating last_time to hours (of the day) and
# after grouping build mean hour
ddf %>%
mutate(hours = as.numeric(lubridate::hms(last_time), unit = 'hour')) %>%
group_by(id) %>% summarise(meanHourOfTheDay = mean(hours, na.rm = T))
#> # A tibble: 29 x 2
#> id meanHourOfTheDay
#> <int> <dbl>
#> 1 11 11.9
#> 2 12 14.0
#> 3 14 14.6
#> 4 15 12.8
#> 5 16 2.45
#> 6 17 8.17
#> 7 18 12.1
#> 8 19 15.4
#> 9 20 17.7
#> 10 21 14.4
#> # … with 19 more rows

Related

R: Accessing first three elements of splitted dataframe

For example,
dateIntervals <- as.Date(c("2020-08-10", "2020-11-11", "2021-07-05"))
possibleDates <- seq(as.Date("2020-01-02"), dateIntervals[3], by = "day")
genDF<-function() data.frame(Date = sample(possibleDates, 100), Value = runif(100))
listdf <-replicate(2, genDF(), simplify = FALSE)
Yes, listdf has two dataframe elements(each 100 random dates in possibleDates and values)
and listdf[[1]] is like this
A data.frame: 100 × 2
Date Value
<date> <dbl>
2020-07-24 0.63482411
2020-02-26 0.25989280
2020-10-26 0.21721077
2020-10-11 0.34774192
2020-08-18 0.67758312
2020-02-03 0.22929624
2020-06-10 0.30279353
2020-05-29 0.95549488
...
lapply(listdf, function(x) split(x, findInterval(x$Date, dateIntervals)))
Made listdf as a 2*3 list, splitted by date.
1.$`0`
A data.frame: 43 × 2
Date Value
<date> <dbl>
1 2020-07-24 0.63482411
2 2020-02-26 0.25989280
6 2020-02-03 0.22929624
7 2020-06-10 0.30279353
...
$`1`
A data.frame: 15 × 2
Date Value
<date> <dbl>
3 2020-10-26 0.21721077
4 2020-10-11 0.34774192
5 2020-08-18 0.67758312
31 2020-11-09 0.59149301
...
$`2`
A data.frame: 42 × 2
Date Value
<date> <dbl>
9 2021-06-28 0.10055644
10 2021-05-17 0.63942936
12 2021-04-22 0.63589801
13 2021-02-01 0.70106156
...
2.$`0`
A data.frame: 43 × 2
Date Value
<date> <dbl>
2 2020-07-16 0.81376364
4 2020-07-03 0.05152627
7 2020-01-21 0.98677433
8 2020-03-23 0.13513921
...
$`1`
A data.frame: 18 × 2
Date Value
<date> <dbl>
5 2020-11-01 0.02740125
12 2020-09-04 0.82042568
15 2020-08-12 0.54190868
16 2020-09-19 0.05933666
18 2020-10-05 0.04983061
...
$`2`
A data.frame: 38 × 2
Date Value
<date> <dbl>
1 2021-04-13 0.46199245
3 2021-06-12 0.71461155
6 2021-01-24 0.56527997
9 2021-04-17 0.72634151
13 2021-04-20 0.55489499
...
I want only first two of the splitted ones.($'0' and $'1' for 1. and 2.)
is there any parameter in the split function which does things like this?
(getting only first or last n elements)
I want something like this...
lapply(listdf, function(x) split(x, findInterval(x$Date, dateIntervals), some parameter=2))
yes this "2". Getting only the first two ones. is there a function parameter in split which can do this?

Find missing date range in R

I have a dataframe of start and end dates, where each row represents a specific trip.
Those date ranges makeup a continuous timeline except around April where there is a discontinuity/lack of data (because no trips were taken).
I would like to find the start and end date of that specific period? (using a tidy approach preferably)
library(tidyverse)
df<- data.frame(start = as.Date(c("2022-01-03", "2022-01-18", "2022-01-31", "2022-03-01" ,"2022-03-08", "2022-03-09", "2022-04-15",
"2022-04-20", "2022-04-20","2022-05-03", "2022-05-17", "2022-05-17", "2022-05-31", "2022-06-05", "2022-06-22" ,"2022-06-28", "2022-07-11")),
end = as.Date(c("2022-01-18","2022-01-31", "2022-03-01" ,"2022-03-08" ,"2022-03-09", "2022-03-25", "2022-04-20" ,"2022-04-20", "2022-05-03",
"2022-05-17" ,"2022-05-17", "2022-05-31", "2022-06-05" ,"2022-06-22" ,"2022-06-28" ,"2022-07-11", "2022-07-17"))) %>%
mutate(trip_number = as.character(row_number()))
df %>%
ggplot()+
geom_segment(aes(x = start, xend = end, y =0, yend= 0, col = trip_number))+
theme(legend.position = "none")
Created on 2022-07-17 by the reprex package (v2.0.1)
A possible solution:
library(tidyverse)
library(lubridate)
df %>%
mutate(date1 = if_else(start == lag(end), NA_Date_, lag(end)),
date2 = if_else(start == lag(end), NA_Date_, start)) %>%
bind_rows(tibble(start = .$date1, end = .$date2)) %>%
filter(!if_all(everything(), is.na)) %>%
arrange(start) %>%
select(!starts_with("date"))
#> start end trip_number
#> 1 2022-01-03 2022-01-18 1
#> 2 2022-01-18 2022-01-31 2
#> 3 2022-01-31 2022-03-01 3
#> 4 2022-03-01 2022-03-08 4
#> 5 2022-03-08 2022-03-09 5
#> 6 2022-03-09 2022-03-25 6
#> 7 2022-03-25 2022-04-15 <NA>
#> 8 2022-04-15 2022-04-20 7
#> 9 2022-04-20 2022-04-20 8
#> 10 2022-04-20 2022-05-03 9
#> 11 2022-05-03 2022-05-17 10
#> 12 2022-05-17 2022-05-17 11
#> 13 2022-05-17 2022-05-31 12
#> 14 2022-05-31 2022-06-05 13
#> 15 2022-06-05 2022-06-22 14
#> 16 2022-06-22 2022-06-28 15
#> 17 2022-06-28 2022-07-11 16
#> 18 2022-07-11 2022-07-17 17

How to merge characters of two columns?

I have a df1:
ID Date
23 4/5/2011
12 4/7/2012
14 6/17/2020
90 12/20/1994
Currently these columns are both character classes. I would like to create a third column that merges these two columns with an underscore. The output would look like:
ID Date Identifier
23 4/5/2011 23_4/5/2011
12 4/7/2012 12_4/7/2012
14 6/17/2020 14_6/17/2020
90 12/20/1994 90_12/20/1994
Use paste():
df1 <- data.frame(ID = c("23","13","14","90"), Date = c("4/5/2011", "4/7/2012", "6/17/2020", "12/20/1994"))
df1 |>
dplyr::mutate(Identifier = paste(ID, Date, sep = "_"))
#> ID Date Identifier
#> 1 23 4/5/2011 23_4/5/2011
#> 2 13 4/7/2012 13_4/7/2012
#> 3 14 6/17/2020 14_6/17/2020
#> 4 90 12/20/1994 90_12/20/1994
Created on 2022-03-06 by the reprex package (v2.0.1)
Using base R, you could do:
Reprex
your data
df <- read.table(text = "ID Date
23 4/5/2011
12 4/7/2012
14 6/17/2020
90 12/20/1994", header = TRUE)
Code
df$Identifier <- paste0(df$ID, "_", df$Date)
Output
df
#> ID Date Identifier
#> 1 23 4/5/2011 23_4/5/2011
#> 2 12 4/7/2012 12_4/7/2012
#> 3 14 6/17/2020 14_6/17/2020
#> 4 90 12/20/1994 90_12/20/1994
Created on 2022-03-06 by the reprex package (v2.0.1)

R: How to locate a column in a large dataframe by using information from another dataframe with less rows

I have a data frame (A) with a column containing some info. I have a larger data frame (B) that contains a column with similar information and I need to detect which column that contains the same data as the column in dataframeA. Because the dataframeB is large, it will be time-consuming to manually look through it though to identify the column. Is there a way that I can use the information from column 'some_info' in DataframeA to find the corresponding column in DataframeB where the information is contained?
dataframeA <- data.frame(some_info = c("a","b","c","d","e") )
dataframeB <- data.frame(id = 1:8, column_to_be_identified = c("a","f","b","c","g", "d","h", "e"), "column_almost_similar_but_not_quite" =c("a","f","b","c","g", "3","h", "e") )
Basically: Is it possible to create a function or something similar that looks through dataframeB and detects the column(s) that contains exactly the information from the column in dataframeA?
Thanks a lot in advance!
If I understand correctly and you just want to receive the column name:
dataframeA <- data.frame(some_info = as.POSIXct(Sys.Date() - 1:5))
dataframeA
#> some_info
#> 1 2021-09-16 02:00:00
#> 2 2021-09-15 02:00:00
#> 3 2021-09-14 02:00:00
#> 4 2021-09-13 02:00:00
#> 5 2021-09-12 02:00:00
class(dataframeA$some_info)
#> [1] "POSIXct" "POSIXt"
dataframeB <- data.frame(id = 1:10,
column_to_be_identified = as.POSIXct(Sys.Date() - 1:10),
column_almost_similar_but_not_quite = as.POSIXct(Sys.Date() - 6:15) )
dataframeB
#> id column_to_be_identified column_almost_similar_but_not_quite
#> 1 1 2021-09-16 02:00:00 2021-09-11 02:00:00
#> 2 2 2021-09-15 02:00:00 2021-09-10 02:00:00
#> 3 3 2021-09-14 02:00:00 2021-09-09 02:00:00
#> 4 4 2021-09-13 02:00:00 2021-09-08 02:00:00
#> 5 5 2021-09-12 02:00:00 2021-09-07 02:00:00
#> 6 6 2021-09-11 02:00:00 2021-09-06 02:00:00
#> 7 7 2021-09-10 02:00:00 2021-09-05 02:00:00
#> 8 8 2021-09-09 02:00:00 2021-09-04 02:00:00
#> 9 9 2021-09-08 02:00:00 2021-09-03 02:00:00
#> 10 10 2021-09-07 02:00:00 2021-09-02 02:00:00
relevant_column_name <- names(
which(
# iterate over all columns
sapply(dataframeB, function(x) {
# unique is more efficient for large vectors
x <- unique(x)
# are all values of the target vector in the column
all(dataframeA$some_info %in% x)
})))
relevant_column_name
#> [1] "column_to_be_identified"
With select from dplyr we can do this
library(dplyr)
dataframeB %>%
select(where(~ is.character(.) &&
all(dataframeA$some_info %in% .))) %>%
names
[1] "column_to_be_identified"

Make datetime derived from one column

I want to create a new column datetime that contains the recorded date-times, derived from the path. The path column is formed like (data/aklbus2017/2017-03-09-05-14.csv), and I need to make a dttm column which make path column became (2017-03-09 05:14:00 ) How can I do it?
The path column looks like
#> # A tibble: 43,793 x 5
#> path delay stop.id stop.sequence route
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 data/aklbus2017/2017-… 150 5050 28 15401-201702141…
#> 2 data/aklbus2017/2017-… 97 3093 6 83401-201702141…
#> 3 data/aklbus2017/2017-… 50 4810 13 98105-201702141…
#> 4 data/aklbus2017/2017-… 58 6838 5 36201-201702141…
#> 5 data/aklbus2017/2017-… -186 2745 11 37301-201702141…
#> 6 data/aklbus2017/2017-… 183 2635 14 03301-201702141…
#> 7 data/aklbus2017/2017-… -144 3360 4 10001-201702141…
#> 8 data/aklbus2017/2017-… -151 2206 20 38011-201702141…
#> 9 data/aklbus2017/2017-… -46 2419 38 38011-201702141…
#> 10 data/aklbus2017/2017-… -513 6906 42 38012-201702141…
#> # … with 43,783 more rows
which i want is
#> # A tibble: 43,793 x 5
#> datetime delay stop.id stop.sequence route
#> <dttm> <dbl> <dbl> <dbl> <chr>
#> 1 2017-03-09 05:14:00 150 5050 28 15401
#> 2 2017-03-09 05:14:00 97 3093 6 83401
#> 3 2017-03-09 05:14:00 50 4810 13 98105
#> 4 2017-03-09 05:14:00 58 6838 5 36201
#> 5 2017-03-09 05:14:00 -186 2745 11 37301
#> 6 2017-03-09 05:14:00 183 2635 14 03301
#> 7 2017-03-09 05:14:00 -144 3360 4 10001
#> 8 2017-03-09 05:14:00 -151 2206 20 38011
#> 9 2017-03-09 05:14:00 -46 2419 38 38011
#> 10 2017-03-09 05:14:00 -513 6906 42 38012
#> # … with 43,783 more rows
We could use parse_date_time function from lubridate after
we used str_sub from stringrpackage
# Example data
df <- tribble(
~path,
"data/aklbus2017/2017-03-09-05-14.csv",
"data/aklbus2017/2017-03-09-06-14.csv",
"data/aklbus2017/2017-03-09-07-14.csv",
"data/aklbus2017/2017-03-09-08-14.csv",
"data/aklbus2017/2017-03-09-09-14.csv",
)
# The code:
library(tidyverse)
library(lubridate)
df %>%
mutate(datetime = parse_date_time(str_sub(path, start=17, end = 32), "ymd_hm"))
Output:
path datetime
<chr> <dttm>
1 data/aklbus2017/2017-03-09-05-14.csv 2017-03-09 05:14:00
2 data/aklbus2017/2017-03-09-06-14.csv 2017-03-09 06:14:00
3 data/aklbus2017/2017-03-09-07-14.csv 2017-03-09 07:14:00
4 data/aklbus2017/2017-03-09-08-14.csv 2017-03-09 08:14:00
5 data/aklbus2017/2017-03-09-09-14.csv 2017-03-09 09:14:00
Here's a way using basename and tools::file_name_sans_ext
library(tools)
df <- data.frame(path=c('data/abc/2017-03-09-05-14.csv','data/xyz/2017-03-10-05-14.csv'))
df$datetime <- as.POSIXct(tools::file_path_sans_ext(basename(df$path)), format="%Y-%m-%d-%H-%M",tz='UTC')
df
path datetime
1 data/abc/2017-03-09-05-14.csv 2017-03-09 05:14:00
2 data/xyz/2017-03-09-05-14.csv 2017-03-09 05:14:00
With a view to these problems, read up on how to work with strings. There are many ways to break up strings in parts, replace parts and/or reassemble them differently.
Using the {tidyverse}, you can do the following.
library(tidyverse)
library(lubridate) # for time parsing - ymd_hms()
#-------- the data
df <- data.frame(path = "data/aklbus2017/2017-03-09-05-14.csv") %>%
#--------- tidyr to break up path into new columns
separate( col = path
,into = c("folder","sub-folder","file")
, sep = "/"
) %>%
#----------- string operation
mutate( dttm = str_remove(string = file, pattern = ".csv")
, dttm2 = ymd_hm(dttm) )
This gives you:
str(df)
'data.frame': 1 obs. of 5 variables:
$ folder : chr "data"
$ sub-folder: chr "aklbus2017"
$ file : chr "2017-03-09-05-14.csv"
$ dttm : chr "2017-03-09-05-14"
$ dttm2 : POSIXct, format: "2017-03-09 05:14:00"
There is no need to keep all columns. You can combine the string operation into one mutate() call. I just put it here to give you an idea on how to "step" through a series of steps to handle your string problem.
library(stringr)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
library(tidyverse)
library(chron)
datetime <-
c('data/aklbus2017/2017-03-09-05-14.csv',
'data/aklbus2017/2017-03-09-05-15.csv',
'data/aklbus2017/2017-03-09-05-16.csv')
date_separated <-
str_match(datetime, '2017/' %R% capture('.*') %R% '\\.csv$')[, 2] %>%
str_match(capture(one_or_more(DGT) %R% '-' %R% one_or_more(DGT) %R% '-' %R% one_or_more(DGT)) %R% '-' %R% capture('.*$')) %>%
`[`(, 2:3)
date_separated
#> [,1] [,2]
#> [1,] "2017-03-09" "05-14"
#> [2,] "2017-03-09" "05-15"
#> [3,] "2017-03-09" "05-16"
date_separated[, 2] <- date_separated[, 2] %>% str_replace('-', ':') %>% str_c(':00')
chron(dates=date_separated[,1],times=date_separated[,2],format=c('y-m-d','h:m:s')) %>% as.POSIXct() %>% tibble(datetime = .)
#> # A tibble: 3 x 1
#> datetime
#> <dttm>
#> 1 2017-03-09 02:14:00
#> 2 2017-03-09 02:15:00
#> 3 2017-03-09 02:16:00
#bind_cols(datetime, data)
Created on 2021-06-06 by the reprex package (v2.0.0)

Resources