I have a file which has messages between customers and agents but these message are not grouped by conversations i.e. there is unique conversation id. Luckily the original message is included in each following reply to that message. The message is in the 'text' column. This can be easily explained by below example
actionDateTime text response postTime
2019-01-01 12:00 Hi N/A 2019-01-01 12:00
2019-01-01 12:01 Hi Hello! 2019-01-01 12:00
2019-01-01 12:02 Hi How can I help? 2019-01-01 12:00
.
.
.
2019-01-02 12:00 Hi there N/A 2019-01-01 12:00
2019-01-02 12:01 Hi there Morning 2019-01-01 12:00
2019-01-02 12:02 Hi there How can I help? 2019-01-01 12:00
So I tried the code below to group but this isn't working.
df %>%
group_by(text, postTime) %>%
mutate(convID = row_number()) %>%
ungroup()
This does output a file with convID but not the way I want. In fact, I don't understand how's it numbering. I believe that's because I'm using two variables in group_by. However, using only one will not work as two different people can message at the same time or two different messages can look similar (e.g. a lot of people can start with just 'Hi').
When I tried only group 'text' it still gives me numbers within a conversation rather than a unique ID. Again, explained below
What I get
text response postTime convID
Hi N/A 2019-01-01 12:00 1
Hi Hello! 2019-01-01 12:00 2
Hi How can I help? 2019-01-01 12:00 3
.
.
.
Hi there N/A 2019-01-01 12:00 1
Hi there Morning 2019-01-01 12:00 2
Hi there How can I help? 2019-01-01 12:00 3
What I want:
text response postTime convID
Hi N/A 2019-01-01 12:00 1
Hi Hello! 2019-01-01 12:00 1
Hi How can I help? 2019-01-01 12:00 1
.
.
.
Hi there N/A 2019-01-01 12:00 2
Hi there Morning 2019-01-01 12:00 2
Hi there How can I help? 2019-01-01 12:00 2
Any help?
We may need group_indices
library(dplyr)
df %>%
mutate(convID = group_indices(., text, postTime))
# actionDateTime text response postTime convID
#1 2019-01-01 12:00 Hi N/A 2019-01-01 12:00 1
#2 2019-01-01 12:01 Hi Hello! 2019-01-01 12:00 1
#3 2019-01-01 12:02 Hi How can I help? 2019-01-01 12:00 1
#4 2019-01-02 12:00 Hi there N/A 2019-01-01 12:00 2
#5 2019-01-02 12:01 Hi there Morning 2019-01-01 12:00 2
#6 2019-01-02 12:02 Hi there How can I help? 2019-01-01 12:00 2
data
df <- structure(list(actionDateTime = c("2019-01-01 12:00", "2019-01-01 12:01",
"2019-01-01 12:02", "2019-01-02 12:00", "2019-01-02 12:01", "2019-01-02 12:02"
), text = c("Hi", "Hi", "Hi", "Hi there", "Hi there", "Hi there"
), response = c("N/A", "Hello!", "How can I help?", "N/A", "Morning",
"How can I help?"), postTime = c("2019-01-01 12:00", "2019-01-01 12:00",
"2019-01-01 12:00", "2019-01-01 12:00", "2019-01-01 12:00", "2019-01-01 12:00"
)), class = "data.frame", row.names = c(NA, -6L))
Related
I have a large dataset with 516 rows (partial dataset below),
Check_In
Ward_1
Elapsed_time
2019-01-01 00:05:18
2019-01-01 00:09:32
4.2333333 mins
2019-01-01 00:11:3
2019-01-01 00:25:04
13.4500000 mins
2019-01-01 00:21:33
2019-01-01 01:03:31
41.9666667 mins
2019-01-01 00:27:18
2019-01-01 01:15:36
48.3000000 mins
2019-01-01 01:44:07
2019-01-01 02:02:45
18.6333333 mins
2019-01-01 02:10:46
2019-01-01 02:26:18
15.5333333 mins
I would like to create a subgroup number column of 3 rows per subgroup (example below) so i can then use the qcc.groups function using the Elapsed_time and subgroup columns
Check_In
Ward_1
Elapsed_time
subgroup
2019-01-01 00:05:18
2019-01-01 00:09:32
4.2333333 mins
1
2019-01-01 00:11:3
2019-01-01 00:25:04
13.4500000 mins
1
2019-01-01 00:21:33
2019-01-01 01:03:31
41.9666667 mins
1
2019-01-01 00:27:18
2019-01-01 01:15:36
48.3000000 mins
2
2019-01-01 01:44:07
2019-01-01 02:02:45
18.6333333 mins
2
2019-01-01 02:10:46
2019-01-01 02:26:18
15.5333333 mins
2
Another base R option
df$subgroup <- ceiling(seq(nrow(df)) / 3)
We can use gl from base R to create the group by specifying the n as number of rows (nrow(df1)) of the dataset and k = 3
df1$subgroup <- as.integer(gl(nrow(df1), 3, nrow(df1)))
data
df1 <- structure(list(Check_In = c("2019-01-01 00:05:18", "2019-01-01 00:11:3",
"2019-01-01 00:21:33", "2019-01-01 00:27:18", "2019-01-01 01:44:07",
"2019-01-01 02:10:46"), Ward_1 = c("2019-01-01 00:09:32", "2019-01-01 00:25:04",
"2019-01-01 01:03:31", "2019-01-01 01:15:36", "2019-01-01 02:02:45",
"2019-01-01 02:26:18"), Elapsed_time = c("4.2333333 mins", "13.4500000 mins",
"41.9666667 mins", "48.3000000 mins", "18.6333333 mins", "15.5333333 mins"
)), class = "data.frame", row.names = c(NA, -6L))
Or simply
df1 %>% mutate(grp = (row_number() +2) %/% 3)
Check_In Ward_1 Elapsed_time grp
1 2019-01-01 00:05:18 2019-01-01 00:09:32 4.2333333 mins 1
2 2019-01-01 00:11:3 2019-01-01 00:25:04 13.4500000 mins 1
3 2019-01-01 00:21:33 2019-01-01 01:03:31 41.9666667 mins 1
4 2019-01-01 00:27:18 2019-01-01 01:15:36 48.3000000 mins 2
5 2019-01-01 01:44:07 2019-01-01 02:02:45 18.6333333 mins 2
6 2019-01-01 02:10:46 2019-01-01 02:26:18 15.5333333 mins 2
df1 dput courtesy beloved #akrun
Or maybe: Thanks to akrun for the data.
library(dplyr)
df1 %>%
mutate(subgroup = rep(row_number(), each=3, length.out = n()))
Output:
Check_In Ward_1 Elapsed_time subgroup
1 2019-01-01 00:05:18 2019-01-01 00:09:32 4.2333333 mins 1
2 2019-01-01 00:11:3 2019-01-01 00:25:04 13.4500000 mins 1
3 2019-01-01 00:21:33 2019-01-01 01:03:31 41.9666667 mins 1
4 2019-01-01 00:27:18 2019-01-01 01:15:36 48.3000000 mins 2
5 2019-01-01 01:44:07 2019-01-01 02:02:45 18.6333333 mins 2
6 2019-01-01 02:10:46 2019-01-01 02:26:18 15.5333333 mins 2
I have two columns in a data frame first is water consumption and the second column is for date+hour. for example
Value Time
12.2 1/1/2016 1:00
11.2 1/1/2016 2:00
10.2 1/1/2016 3:00
The data is for 4 years and I want to create separate columns for month date year and hour.
I would appreciate any help
We can convert to Datetime and then extract the components. We assume the format of 'Time' column is 'dd/mm/yyyy H:M' (in case it is different i.e. 'mm/dd/yyyy H:M', change the dmy_hm to mdy_hm)
library(dplyr)
library(lubridate)
df1 %>%
mutate(Time = dmy_hm(Time), month = month(Time),
year = year(Time), hour = hour(Time))
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
In base R, we can either use strptime or as.POSIXct and then use either format or extract components
df1$Time <- strptime(df1$Time, "%d/%m/%Y %H:%M")
transform(df1, month = Time$mon+1, year = Time$year + 1900, hour = Time$hour)
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
data
df1 <- structure(list(Value = c(12.2, 11.2, 10.2), Time = c("1/1/2016 1:00",
"1/1/2016 2:00", "1/1/2016 3:00")), class = "data.frame", row.names = c(NA,
-3L))
OK, this is making me crazy.
I have several datasets with time values that need to be rolled up into 15 minute intervals.
I found a solution here that works beautifully on one dataset. But on the next one I try to do I'm getting weird results. I have a column with character data representing dates:
BeginTime
-------------------------------
1 1/3/19 1:50 PM
2 1/3/19 1:30 PM
3 1/3/19 4:56 PM
4 1/4/19 11:23 AM
5 1/6/19 7:45 PM
6 1/7/19 10:15 PM
7 1/8/19 12:02 PM
8 1/9/19 10:43 PM
And I'm using the following code (which is exactly what I used on the other dataset except for the names)
df$by15 = cut(mdy_hm(df$BeginTime), breaks="15 min")
but what I get is:
BeginTime by15
-------------------------------------------------------
1 1/3/19 1:50 PM 2019-01-03 13:36:00
2 1/3/19 1:30 PM 2019-01-03 13:21:00
3 1/3/19 4:56 PM 2019-01-03 16:51:00
4 1/4/19 11:23 AM 2019-01-04 11:21:00
5 1/6/19 7:45 PM 2019-01-06 19:36:00
6 1/7/19 10:15 PM 2019-01-07 22:06:00
7 1/8/19 12:02 PM 2019-01-08 11:51:00
8 1/9/19 10:43 PM 2019-01-09 22:36:00
9 1/10/19 11:25 AM 2019-01-10 11:21:00
Any suggestions on why I'm getting such random times instead of the 15-minute intervals I'm looking for? Like I said, this worked fine on the other data set.
You can use lubridate::round_date() function which will roll-up your datetime data as follows;
library(lubridate) # To handle datetime data
library(dplyr) # For data manipulation
# Creating dataframe
df <-
data.frame(
BeginTime = c("1/3/19 1:50 PM", "1/3/19 1:30 PM", "1/3/19 4:56 PM",
"1/4/19 11:23 AM", "1/6/19 7:45 PM", "1/7/19 10:15 PM",
"1/8/19 12:02 PM", "1/9/19 10:43 PM")
)
df %>%
# First we parse the data in order to convert it from string format to datetime
mutate(by15 = parse_date_time(BeginTime, '%d/%m/%y %I:%M %p'),
# We roll up the data/round it to 15 minutes interval
by15 = round_date(by15, "15 mins"))
#
# BeginTime by15
# 1/3/19 1:50 PM 2019-03-01 13:45:00
# 1/3/19 1:30 PM 2019-03-01 13:30:00
# 1/3/19 4:56 PM 2019-03-01 17:00:00
# 1/4/19 11:23 AM 2019-04-01 11:30:00
# 1/6/19 7:45 PM 2019-06-01 19:45:00
# 1/7/19 10:15 PM 2019-07-01 22:15:00
# 1/8/19 12:02 PM 2019-08-01 12:00:00
# 1/9/19 10:43 PM 2019-09-01 22:45:00
I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:
table1_schedule
Channel Program program_Date start_time
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
table2_watch
Person Channel program_Date start_time end_time
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,
The wanted output
Person Channel program_Date start_time end_time Prog1 Prog2 Prog3
Name A NatGeo 1/1/2018 11:00 12:00 Doc A Doc B Doc C
Name B NatGeo 1/1/2018 12:30 14:00 Doc C Doc D -NA-
Name B HBO 1/1/2018 21:30 22:00 Mov A -NA- -NA-
Name B HBO 1/1/2018 22:30 23:30 Mov A Mov B -NA-
Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks
Here is a data.table solution where we can make use foverlap.
I'm showing every step with a short comment, to hopefully help with understanding.
library(data.table)
# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]
# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]
# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]
# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
# Channel Person start end Prog1 Prog2 Prog3
#1: NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2: NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D NA
#3: HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A NA NA
#4: HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B NA
Sample data
df1 <- read.table(text =
"Channel Program program_Date start_time
HBO 'Mov A' 1/1/2018 21:00
HBO 'Mov B' 1/1/2018 23:00
HBO 'Mov C' 1/1/2018 23:59
NatGeo 'Doc A' 1/1/2018 11:00
NatGeo 'Doc B' 1/1/2018 11:30
NatGeo 'Doc C' 1/1/2018 12:00
NatGeo 'Doc D' 1/1/2018 14:00", header = T)
df2 <- read.table(text =
"Person Channel program_Date start_time end_time
'Name A' NatGeo 1/1/2018 11:00 12:00
'Name B' NatGeo 1/1/2018 12:30 14:00
'Name B' HBO 1/1/2018 21:30 22:00
'Name B' HBO 1/1/2018 22:30 23:30", header = T)
Here is how I would go about doing this. Note that I renamed some of your stuff.
> cat schedule
Channel Program Date StartTime
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
> cat watch
Person Channel Date StartTime EndTime
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
Now, make sure we read these correctly using readr. In other words, specify the correct formats for the dates and the times.
library(dplyr)
library(readr)
library(lubridate)
schedule <- read_table("schedule",
col_types=cols_only(Channel=col_character(),
Program=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M")))
watch <- read_table("watch",
col_types=cols_only(Person=col_character(),
Channel=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M"),
EndTime=col_time("%H:%M")))
Next, we convert all dates and times to datetimes and add an ending datetime to the schedule.
schedule <- schedule %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
group_by(Channel) %>%
mutate(EndDateTime=lead(StartDateTime, default=as_datetime(Inf))) %>%
ungroup() %>%
select(Channel, Program, StartDateTime, EndDateTime)
watch <- watch %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
mutate(EndDateTime=ymd_hms(paste(Date, EndTime))) %>%
select(Person, Channel, StartDateTime, EndDateTime)
We can perform a join and check if the watch and schedule intervals overlap (you can modify this to accommodate to your 10 minute comment I believe, although I did not fully understand what you meant).
watch %>%
inner_join(schedule,
by=c("Channel" = "Channel"),
suffix=c(".Watch", ".Schedule")) %>%
filter(int_overlaps(interval(StartDateTime.Watch, EndDateTime.Watch),
interval(StartDateTime.Schedule, EndDateTime.Schedule))) %>%
select(Person, Channel, Program, StartDateTime.Watch, EndDateTime.Watch) %>%
rename_at(.vars=vars(ends_with(".Watch")),
.funs=funs(sub("\\.Watch$", "", .)))
# A tibble: 8 x 5
Person Channel Program StartDateTime EndDateTime
<chr> <chr> <chr> <dttm> <dttm>
1 Name A NatGeo Doc A 2018-01-01 11:00:00 2018-01-01 12:00:00
2 Name A NatGeo Doc B 2018-01-01 11:00:00 2018-01-01 12:00:00
3 Name A NatGeo Doc C 2018-01-01 11:00:00 2018-01-01 12:00:00
4 Name B NatGeo Doc C 2018-01-01 12:30:00 2018-01-01 14:00:00
5 Name B NatGeo Doc D 2018-01-01 12:30:00 2018-01-01 14:00:00
6 Name B HBO Mov A 2018-01-01 21:30:00 2018-01-01 22:00:00
7 Name B HBO Mov A 2018-01-01 22:30:00 2018-01-01 23:30:00
8 Name B HBO Mov B 2018-01-01 22:30:00 2018-01-01 23:30:00
To get the desired output, you would have to group by everything except Program and "explode" the resulting groups into multiple columns. However, I am not sure if that is a good idea so I did not do it.
I have some data, and the Date column includes the time too. I am trying to get this data into xts format. I have tried below, but I get an error. Can anyone see anything wrong with this code? TIA
Date Open High Low Close
1 2017.01.30 07:00 1.25735 1.25761 1.25680 1.25698
2 2017.01.30 08:00 1.25697 1.25702 1.25615 1.25619
3 2017.01.30 09:00 1.25618 1.25669 1.25512 1.25533
4 2017.01.30 10:00 1.25536 1.25571 1.25093 1.25105
5 2017.01.30 11:00 1.25104 1.25301 1.25093 1.25262
6 2017.01.30 12:00 1.25260 1.25479 1.25229 1.25361
7 2017.01.30 13:00 1.25362 1.25417 1.25096 1.25177
8 2017.01.30 14:00 1.25177 1.25219 1.24900 1.25071
9 2017.01.30 15:00 1.25070 1.25307 1.24991 1.25238
10 2017.01.30 16:00 1.25238 1.25358 1.25075 1.25159
df = read.table(file = "GBPUSD60.csv", sep="," , header = TRUE)
dates = as.character(df$Date)
df$Date = NULL
Sept17 = xts(df, as.POSIXct(dates, format="%Y-%m-%d %H:%M"))