I've got this sample dataframe, that keeps track of the time when a lamp is switched on and off.
time lamp status
1 2015-01-01 12:18:17 2 ON
2 2015-01-01 13:07:29 28 ON
3 2015-01-01 13:11:50 28 OFF
4 2015-01-01 13:18:28 2 OFF
5 2015-01-01 14:07:29 28 ON
6 2015-01-01 14:11:35 28 OFF
7 2015-01-01 14:18:28 2 ON
5 2015-01-01 14:18:57 2 OFF
What I want to achieve is to add a fourth column, containing the duration of a period where a lamp has been switched on (in seconds).
The desired output:
time lamp status duration
1 2015-01-01 12:18:17 2 ON 3611
2 2015-01-01 13:07:29 28 ON 261
3 2015-01-01 13:11:50 28 OFF NA
4 2015-01-01 13:18:28 2 OFF NA
5 2015-01-01 14:07:29 28 ON 246
6 2015-01-01 14:11:35 28 OFF NA
7 2015-01-01 14:18:28 2 ON 29
5 2015-01-01 14:18:57 2 OFF NA
I already succeeded in doing this with a custom function, involving while and for-loops. BUT...
I'm a beginner in R, and I'm pretty sure this can be done more simple and elegant (using subsets, apply, and/or ....). I just can't figure out how?
Any ideas, of leads in the right direction?
This works for me:
library(dplyr)
df <- df %>% mutate(sec=as.numeric(time)) %>% group_by(lamp) %>% mutate(duration=c(diff(sec), NA)) %>% select(-sec)
df$duration[df$status=="OFF"] <- NA
#### 1 2015-01-01 12:18:17 2 ON 3611
#### 2 2015-01-01 13:07:29 28 ON 261
#### 3 2015-01-01 13:11:50 28 OFF NA
Your data:
df=structure(list(time = structure(c(1420111097, 1420114049, 1420114310,
1420114708, 1420117649, 1420117895, 1420118308, 1420118337), class = c("POSIXct",
"POSIXt"), tzone = ""), lamp = c(2L, 28L, 28L, 2L, 28L, 28L,
2L, 2L), status = structure(c(2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("OFF",
"ON"), class = "factor"), duration = c(2952, 261, NA, NA, 246,
NA, 29, NA)), .Names = c("time", "lamp", "status", "duration"
), row.names = c(NA, -8L), class = "data.frame")
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Given a time series entailing data about cinemas, the identifier "dates" are of interest. I would like to convert into the format "YYYY/MM/DD." However, when I run my code:
CINEMA.TICKET$DATE <- as.Date(CINEMA.TICKET$date , format = "%y/%m/%d")
Two issues occur:
First, the dates are shown on the far right of the table as, e.g. , "0005-05-20." And many entries disappear entirely. Can someone explain what I am doing wrong, and how can I do it properly?
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day newdate DATE
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 5/5/2018 5 2 5 0005-05-20 2005-05-20
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 5/5/2018 5 2 5 0005-05-20 2005-05-20
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 5/5/2018 5 2 5 0005-05-20 2005-05-20
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 5/5/2018 5 2 5 0005-05-20 2005-05-20
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 5/5/2018 5 2 5 0005-05-20 2005-05-20
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 5/5/2018 5 2 5 0005-05-20 2005-05-20
> str(CINEMA.TICKET)
As #Dave2e pointed out. You are looking for:
CINEMA.TICKET[, date := as.Date(date , format = "%d/%m/%Y")]
assuming our input format is "30/5/2018" since question is not clear with an example of "5/5/2018" where this could be "%d/%m/%Y" or "%m/%d/%Y"
As for ordering columns use:
setcolorder(CINEMA.TICKET, c("c", "b", "a"))
where c,b,a are column names in their desired order
lubridate probably does the trick
> lubridate::mdy("5/5/2018")
[1] "2018-05-05"
So you should use
library(lubridate)
library(tidyverse)
CINEMA.TICKET <- CINEMA.TICKET %>%
mutate(DATE=mdy(date))
Here is another option:
library(tidyverse)
output <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y"))
Output
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 2018-05-05 5 2 5
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 2018-05-05 5 2 5
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 2018-05-05 5 2 5
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 2018-05-05 5 2 5
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 2018-05-05 5 2 5
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 2018-05-05 5 2 5
To have date classified as a date, you cannot have the forward slash. You can change the format, but it will no longer be classified as date, but will be classified as character again.
class(output$date)
# [1] "Date"
output2 <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y")) %>%
mutate(date = format(date, "%Y/%m/%d"))
class(output2$date)
# [1] "character"
Data
df <-
structure(
list(
film_code = c(1492L, 1492L, 1492L, 1492L, 1492L,
1492L),
cinema_code = c(304L, 352L, 489L, 429L, 524L, 71L),
total_sales = c(3900000L,
3360000L, 2560000L, 1200000L, 1200000L, 1050000L),
tickets_sold = c(26L,
42L, 32L, 12L, 15L, 7L),
tickets_out = c(0L, 0L, 0L, 0L, 0L,
0L),
show_time = c(4L, 5L, 4L, 1L, 3L, 3L),
occu_perc = c(4.26,
8.08, 20, 11.01, 16.67, 0.98),
ticket_price = c(150000L, 80000L,
80000L, 100000L, 80000L, 150000L),
ticket_use = c(26L, 42L, 32L,
12L, 15L, 7L),
capacity = c(610.3286, 519.802, 160, 108.9918,
89.982, 714.2857),
date = c("5/5/2018", "5/5/2018", "5/5/2018", "5/5/2018",
"5/5/2018", "5/5/2018"),
month = c(5L, 5L, 5L, 5L, 5L, 5L),
quarter = c(2L,
2L, 2L, 2L, 2L, 2L),
day = c(5L, 5L, 5L, 5L, 5L, 5L)
),
class = "data.frame",
row.names = c(NA,-6L)
)
My dataframe is like this :
Device_id Group Nb_burst Date_time
24 1 3 2018-09-02 10:04:04
24 1 5 2018-09-02 10:08:00
55 2 3 2018-09-03 10:14:34
55 2 7 2018-09-03 10:02:29
16 3 2 2018-09-20 08:17:11
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I would like to know, only for the same ID, the same Group number, and the same Date, the timelag between two rows.
If timelag > 1 hour then all right keep them all.
If timelag < 1 hour then keep only the rows with the biggest Nb_burst.
Which mean a dataframe like :
Device_id Group Nb_burst Date_time
24 1 5 2018-09-02 10:08:00
55 2 7 2018-09-03 10:02:29
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I tried :
Data$timelag <- c(NA, difftime(Data$Min_start.time[-1], Data$Min_start.time[-nrow(Data)], units="hours"))
But I don't know how test only when Date, ID, and Group are the same, probably a loop.
My df has 1500 rows.
Hope someone could help me. Thank you !
I'm not sure why your group 3 is not duplicated, since time difference is greater than one hour.
But, you could create two indexing variables using ave. First, the order of the Nb_burst for each grouping. Second, the tine differences for each grouping.
dat <- within(dat, {
score <- ave(Nb_burst, Device_id, Group, as.Date(Date_time),
FUN=order)
thrsh <- abs(ave(as.numeric(Date_time), Device_id, Group, as.Date(Date_time),
FUN=diff)/3600) > 1
})
Finally subset by rowSums.
dat[rowSums(dat[c("score", "thrsh")]) > 1,1:4]
# Device_id Group Nb_burst Date_time
# 2 24 1 5 2018-09-02 10:08:00
# 3 55 2 7 2018-09-03 10:14:34
# 5 16 3 2 2018-09-20 08:17:11
# 6 16 3 71 2018-09-20 06:03:40
# 7 22 4 10 2018-10-02 11:33:55
# 8 22 4 14 2018-10-02 16:22:18
Data
dat <- structure(list(Device_id = c(24L, 24L, 55L, 55L, 16L, 16L, 22L,
22L), Group = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Nb_burst = c(3L,
5L, 7L, 3L, 2L, 71L, 10L, 14L), Date_time = structure(c(1535875444,
1535875680, 1535962474, 1535961749, 1537424231, 1537416220, 1538472835,
1538490138), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-8L), class = "data.frame")
The dataframe that I am working with contains entries of subscriptions with a start and stop date of the subscription. A user can have multiple rows since he/she can have or had multiple subscriptions. I would like to know if a certain subscription is followed up by another subscription.
I have considered using a for loop since the amount of observations is not particularly high (approx. 2000). However, my knowledge on this subject is not particularly high so I couldn't manage to create one. Each user has its own ID code. There are different types of subscriptions possible. I have created a dummy variable for the specific subscription of which I want to check if it is followed up on.
An example of what the data looks like:
id startdate stopdate subscriptiontype
1 2013-05-01 2013-06-01 1
2 2010-05-02 2012-05-02 3
2 2013-05-02 2013-06-02 1
2 2013-07-23 2013-12-23 2
4 2008-05-02 2011-05-02 3
4 2013-05-04 2013-06-04 1
I would like to see for each 'id' if there is another subscription with a starting date after the stopdate of subscription type 1. Would this be possible? Thank you for reading!
DATA
structure(list(id = c(1, 2, 2, 2, 4, 4), startdate = structure(c(3L,
2L, 4L, 6L, 1L, 5L), .Label = c("2008-05-02", "2010-05-02", "2013-05-01",
"2013-05-02", "2013-05-04", "2013-07-23"), class = "factor"),
stopdate = structure(c(3L, 2L, 4L, 6L, 1L, 5L), .Label = c("2011-05-02",
"2012-05-02", "2013-06-01", "2013-06-02", "2013-06-04", "2013-12-23"
), class = "factor"), subscriptiontype = c(1, 3, 1, 2, 3,
1)), class = "data.frame", row.names = c(NA, -6L))
I modified your data a little bit more and did the following. For each group, I think you want to check if there is any subscription type follows subscription type 1. First, I converted two columns to class date, just in case. Then, for each ID, I ran logical checks. Basically, I am asking "Is the previous value in subscriptiontype 1?".
library(dplyr)
library(lubridate)
mutate_at(mydf, vars(contains("date")),
.funs = list(~ymd(.))) %>%
group_by(id) %>%
mutate(check = lag(subscriptiontype) == 1)
id startdate stopdate subscriptiontype check
<int> <date> <date> <int> <lgl>
1 1 2013-05-01 2013-06-01 1 NA
2 2 2010-05-02 2012-05-02 3 NA
3 2 2013-05-02 2013-06-02 1 FALSE
4 2 2013-07-23 2013-12-23 2 TRUE
5 4 2008-05-02 2011-05-02 3 NA
6 4 2013-05-04 2013-06-04 1 FALSE
7 7 2018-01-01 2018-02-01 3 NA
8 7 2018-03-01 2018-03-15 1 FALSE
9 7 2018-04-01 2018-05-15 4 TRUE
DATA
mydf <- structure(list(id = c(1L, 2L, 2L, 2L, 4L, 4L, 7L, 7L, 7L), startdate = c("2013-05-01",
"2010-05-02", "2013-05-02", "2013-07-23", "2008-05-02", "2013-05-04",
"2018-01-01", "2018-03-01", "2018-04-01"), stopdate = c("2013-06-01",
"2012-05-02", "2013-06-02", "2013-12-23", "2011-05-02", "2013-06-04",
"2018-02-01", "2018-03-15", "2018-05-15"), subscriptiontype = c(1L,
3L, 1L, 2L, 3L, 1L, 3L, 1L, 4L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
id startdate stopdate subscriptiontype
1 1 2013-05-01 2013-06-01 1
2 2 2010-05-02 2012-05-02 3
3 2 2013-05-02 2013-06-02 1
4 2 2013-07-23 2013-12-23 2
5 4 2008-05-02 2011-05-02 3
6 4 2013-05-04 2013-06-04 1
7 7 2018-01-01 2018-02-01 3
8 7 2018-03-01 2018-03-15 1
9 7 2018-04-01 2018-05-15 4
You can self-join the table. You first filter the users based on whether they have subscription type "1", then join for any other subscription type. You then check whether a user has another subscription (2,3,4) which starts after the first ends. Finally, you can collapse by user using "summarise" to see whether our conditions are true.
library(dplyr)
mydf%>%
filter(subscriptiontype==1)%>%
full_join(mydf[mydf$subscriptiontype!=1,], by="id", suffix=c(".Type1",".OtherType"))%>%
mutate(check=as.Date(startdate.OtherType)>=as.Date(stopdate.Type1))%>%
group_by(id)%>%
summarise(any(check, na.rm = TRUE))
id `any(check, na.rm = TRUE)`
<dbl> <lgl>
1 FALSE
2 TRUE
4 FALSE
I have a data frame with thousands of ids with several events per id and enrollment dates, course and record. Course is categorical, module1, module2, module3, molude4, module5 and withdrawn(any module). For example few rows looks like below
id event enrolment date Enrolment to course record
1 42 2012-07-01 2013-06-30 module 5 2
1 42 2018-07-01 2019-06-30 **module 4** 1
1 43 2012-07-01 2013-06-30 module 5 2
1 43 2018-07-01 2019-06-30 **module 4** 1
2 50 2017-04-01 2018-03-31 **module 5** 2
2 50 2017-07-01 2018-03-31 module 4 1
2 34 2017-04-01 2018-03-31 **module 5** 2
2 34 2017-07-01 2018-03-31 module 4 1
3 23 2014-08-20 2015-07-20 module 5 1
3 23 2014-08-20 2015-07-20 module 4 2
3 23 2015-07-04 2016-06-04 **withdrawn** 3
4 13 2017-09-01 2018-08-01 module 4 1
4 13 2017-09-01 2018-08-01 **module 5** 2
4 23 2017-09-01 2018-08-01 module 4 1
4 23 2017-09-01 2018-08-01 **module 5** 2
I would like to retain 2nd,4th,5th,7th,11th,13th, & 15th row in
the data frame (education)
I tried factoring course which wrongly assigns module 5 for events 42 & 43 and if I go by max date then it wrongly assigns module 4 to events 50 & 34
I would like data to look like below
id event status_date Course record
1 42 2018-07-01 module 4 1
1 43 2018-07-01 module 4 1
2 50 2017-04-01 module 5 2
2 34 2016-04-01 module 5 2
3 23 2015-07-04 withdrawn 3
4 13 2017-09-01 module 5 2
4 23 2017-09-01 module 5 2
If I have understood all the requirements clearly here is a function which selects the correct date in each group
library(dplyr)
select_dates <- function(start, end, course) {
#If there is same date return course with "module5"
if (n_distinct(start) == 1)
which.max(course == "module5")
else {
#Get courses which are currently enrolled
inds <- max(start) < end
#If any course has "module5" and no "withdrawn"
if (any(course[inds] == "module5") & all(course[inds] != "withdrawn"))
#return the course with "module5" which is currently enrolled
which.max(inds & course == "module5")
else
#return the currently enrolled course with a max date
which.max(start == max(start[inds]))
}
}
We then apply it for each id and event
df %>%
mutate_at(vars(enrolment_date, Enrolment_to), as.Date) %>%
group_by(id, event) %>%
slice(select_dates(enrolment_date, Enrolment_to, course))
# id event enrolment_date Enrolment_to course record
# <int> <int> <date> <date> <chr> <int>
#1 1 42 2018-07-01 2019-06-30 module4 1
#2 1 43 2018-07-01 2019-06-30 module4 1
#3 2 34 2017-04-01 2018-03-31 module5 2
#4 2 50 2017-04-01 2018-03-31 module5 2
#5 3 23 2015-07-04 2016-06-04 withdrawn 3
#6 4 13 2017-09-01 2018-08-01 module5 2
#7 4 23 2017-09-01 2018-08-01 module5 2
Note that you need to change the strings in the function ("module5" and "withdrawn") and the column names (enrolment_date and Enrolment_to) based on what you have in your data.
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L), event = c(42L, 42L, 43L, 43L, 50L, 50L,
34L, 34L, 23L, 23L, 23L, 13L, 13L, 23L, 23L), enrolment_date = c("2012-07-01",
"2018-07-01", "2012-07-01", "2018-07-01", "2017-04-01", "2017-07-01",
"2017-04-01", "2017-07-01", "2014-08-20", "2014-08-20", "2015-07-04",
"2017-09-01", "2017-09-01", "2017-09-01", "2017-09-01"), Enrolment_to = c("2013-06-30",
"2019-06-30", "2013-06-30", "2019-06-30", "2018-03-31", "2018-03-31",
"2018-03-31", "2018-03-31", "2015-07-20", "2015-07-20", "2016-06-04",
"2018-08-01", "2018-08-01", "2018-08-01", "2018-08-01"), course = c("module5",
"module4", "module5", "module4", "module5", "module4", "module5",
"module4", "module5", "module4", "withdrawn", "module4", "module5",
"module4", "module5"), record = c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 3L, 1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -15L))
I am dealing with a dataset like this
Id Value Date
1 250 NA
1 250 2010-06-21
2 6 NA
2 6 2012-08-23
3 545 NA
7 3310 NA
My goal is to remove entire rows if there is an NA in Date column and ID is duplicate. The final output should look like:
Id Value Date
1 250 2010-06-21
2 6 2012-08-23
3 545 NA
7 3310 NA
df1[!(is.na(df1$Date) & duplicated(df1$Id) | duplicated(df1$Id, fromLast = TRUE)),]
# Id Value Date
#2 1 250 2010-06-21
#4 2 6 2012-08-23
#5 3 545 <NA>
#6 7 3310 <NA>
DATA
df1 = structure(list(Id = c(1L, 1L, 2L, 2L, 3L, 7L), Value = c(250L,
250L, 6L, 6L, 545L, 3310L), Date = c(NA, "2010-06-21", NA, "2012-08-23",
NA, NA)), .Names = c("Id", "Value", "Date"), class = "data.frame", row.names = c(NA,
-6L))