My dataframe is like this :
Device_id Group Nb_burst Date_time
24 1 3 2018-09-02 10:04:04
24 1 5 2018-09-02 10:08:00
55 2 3 2018-09-03 10:14:34
55 2 7 2018-09-03 10:02:29
16 3 2 2018-09-20 08:17:11
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I would like to know, only for the same ID, the same Group number, and the same Date, the timelag between two rows.
If timelag > 1 hour then all right keep them all.
If timelag < 1 hour then keep only the rows with the biggest Nb_burst.
Which mean a dataframe like :
Device_id Group Nb_burst Date_time
24 1 5 2018-09-02 10:08:00
55 2 7 2018-09-03 10:02:29
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I tried :
Data$timelag <- c(NA, difftime(Data$Min_start.time[-1], Data$Min_start.time[-nrow(Data)], units="hours"))
But I don't know how test only when Date, ID, and Group are the same, probably a loop.
My df has 1500 rows.
Hope someone could help me. Thank you !
I'm not sure why your group 3 is not duplicated, since time difference is greater than one hour.
But, you could create two indexing variables using ave. First, the order of the Nb_burst for each grouping. Second, the tine differences for each grouping.
dat <- within(dat, {
score <- ave(Nb_burst, Device_id, Group, as.Date(Date_time),
FUN=order)
thrsh <- abs(ave(as.numeric(Date_time), Device_id, Group, as.Date(Date_time),
FUN=diff)/3600) > 1
})
Finally subset by rowSums.
dat[rowSums(dat[c("score", "thrsh")]) > 1,1:4]
# Device_id Group Nb_burst Date_time
# 2 24 1 5 2018-09-02 10:08:00
# 3 55 2 7 2018-09-03 10:14:34
# 5 16 3 2 2018-09-20 08:17:11
# 6 16 3 71 2018-09-20 06:03:40
# 7 22 4 10 2018-10-02 11:33:55
# 8 22 4 14 2018-10-02 16:22:18
Data
dat <- structure(list(Device_id = c(24L, 24L, 55L, 55L, 16L, 16L, 22L,
22L), Group = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Nb_burst = c(3L,
5L, 7L, 3L, 2L, 71L, 10L, 14L), Date_time = structure(c(1535875444,
1535875680, 1535962474, 1535961749, 1537424231, 1537416220, 1538472835,
1538490138), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-8L), class = "data.frame")
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Given a time series entailing data about cinemas, the identifier "dates" are of interest. I would like to convert into the format "YYYY/MM/DD." However, when I run my code:
CINEMA.TICKET$DATE <- as.Date(CINEMA.TICKET$date , format = "%y/%m/%d")
Two issues occur:
First, the dates are shown on the far right of the table as, e.g. , "0005-05-20." And many entries disappear entirely. Can someone explain what I am doing wrong, and how can I do it properly?
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day newdate DATE
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 5/5/2018 5 2 5 0005-05-20 2005-05-20
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 5/5/2018 5 2 5 0005-05-20 2005-05-20
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 5/5/2018 5 2 5 0005-05-20 2005-05-20
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 5/5/2018 5 2 5 0005-05-20 2005-05-20
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 5/5/2018 5 2 5 0005-05-20 2005-05-20
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 5/5/2018 5 2 5 0005-05-20 2005-05-20
> str(CINEMA.TICKET)
As #Dave2e pointed out. You are looking for:
CINEMA.TICKET[, date := as.Date(date , format = "%d/%m/%Y")]
assuming our input format is "30/5/2018" since question is not clear with an example of "5/5/2018" where this could be "%d/%m/%Y" or "%m/%d/%Y"
As for ordering columns use:
setcolorder(CINEMA.TICKET, c("c", "b", "a"))
where c,b,a are column names in their desired order
lubridate probably does the trick
> lubridate::mdy("5/5/2018")
[1] "2018-05-05"
So you should use
library(lubridate)
library(tidyverse)
CINEMA.TICKET <- CINEMA.TICKET %>%
mutate(DATE=mdy(date))
Here is another option:
library(tidyverse)
output <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y"))
Output
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 2018-05-05 5 2 5
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 2018-05-05 5 2 5
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 2018-05-05 5 2 5
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 2018-05-05 5 2 5
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 2018-05-05 5 2 5
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 2018-05-05 5 2 5
To have date classified as a date, you cannot have the forward slash. You can change the format, but it will no longer be classified as date, but will be classified as character again.
class(output$date)
# [1] "Date"
output2 <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y")) %>%
mutate(date = format(date, "%Y/%m/%d"))
class(output2$date)
# [1] "character"
Data
df <-
structure(
list(
film_code = c(1492L, 1492L, 1492L, 1492L, 1492L,
1492L),
cinema_code = c(304L, 352L, 489L, 429L, 524L, 71L),
total_sales = c(3900000L,
3360000L, 2560000L, 1200000L, 1200000L, 1050000L),
tickets_sold = c(26L,
42L, 32L, 12L, 15L, 7L),
tickets_out = c(0L, 0L, 0L, 0L, 0L,
0L),
show_time = c(4L, 5L, 4L, 1L, 3L, 3L),
occu_perc = c(4.26,
8.08, 20, 11.01, 16.67, 0.98),
ticket_price = c(150000L, 80000L,
80000L, 100000L, 80000L, 150000L),
ticket_use = c(26L, 42L, 32L,
12L, 15L, 7L),
capacity = c(610.3286, 519.802, 160, 108.9918,
89.982, 714.2857),
date = c("5/5/2018", "5/5/2018", "5/5/2018", "5/5/2018",
"5/5/2018", "5/5/2018"),
month = c(5L, 5L, 5L, 5L, 5L, 5L),
quarter = c(2L,
2L, 2L, 2L, 2L, 2L),
day = c(5L, 5L, 5L, 5L, 5L, 5L)
),
class = "data.frame",
row.names = c(NA,-6L)
)
Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.
Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))
I have a data frame with thousands of ids with several events per id and enrollment dates, course and record. Course is categorical, module1, module2, module3, molude4, module5 and withdrawn(any module). For example few rows looks like below
id event enrolment date Enrolment to course record
1 42 2012-07-01 2013-06-30 module 5 2
1 42 2018-07-01 2019-06-30 **module 4** 1
1 43 2012-07-01 2013-06-30 module 5 2
1 43 2018-07-01 2019-06-30 **module 4** 1
2 50 2017-04-01 2018-03-31 **module 5** 2
2 50 2017-07-01 2018-03-31 module 4 1
2 34 2017-04-01 2018-03-31 **module 5** 2
2 34 2017-07-01 2018-03-31 module 4 1
3 23 2014-08-20 2015-07-20 module 5 1
3 23 2014-08-20 2015-07-20 module 4 2
3 23 2015-07-04 2016-06-04 **withdrawn** 3
4 13 2017-09-01 2018-08-01 module 4 1
4 13 2017-09-01 2018-08-01 **module 5** 2
4 23 2017-09-01 2018-08-01 module 4 1
4 23 2017-09-01 2018-08-01 **module 5** 2
I would like to retain 2nd,4th,5th,7th,11th,13th, & 15th row in
the data frame (education)
I tried factoring course which wrongly assigns module 5 for events 42 & 43 and if I go by max date then it wrongly assigns module 4 to events 50 & 34
I would like data to look like below
id event status_date Course record
1 42 2018-07-01 module 4 1
1 43 2018-07-01 module 4 1
2 50 2017-04-01 module 5 2
2 34 2016-04-01 module 5 2
3 23 2015-07-04 withdrawn 3
4 13 2017-09-01 module 5 2
4 23 2017-09-01 module 5 2
If I have understood all the requirements clearly here is a function which selects the correct date in each group
library(dplyr)
select_dates <- function(start, end, course) {
#If there is same date return course with "module5"
if (n_distinct(start) == 1)
which.max(course == "module5")
else {
#Get courses which are currently enrolled
inds <- max(start) < end
#If any course has "module5" and no "withdrawn"
if (any(course[inds] == "module5") & all(course[inds] != "withdrawn"))
#return the course with "module5" which is currently enrolled
which.max(inds & course == "module5")
else
#return the currently enrolled course with a max date
which.max(start == max(start[inds]))
}
}
We then apply it for each id and event
df %>%
mutate_at(vars(enrolment_date, Enrolment_to), as.Date) %>%
group_by(id, event) %>%
slice(select_dates(enrolment_date, Enrolment_to, course))
# id event enrolment_date Enrolment_to course record
# <int> <int> <date> <date> <chr> <int>
#1 1 42 2018-07-01 2019-06-30 module4 1
#2 1 43 2018-07-01 2019-06-30 module4 1
#3 2 34 2017-04-01 2018-03-31 module5 2
#4 2 50 2017-04-01 2018-03-31 module5 2
#5 3 23 2015-07-04 2016-06-04 withdrawn 3
#6 4 13 2017-09-01 2018-08-01 module5 2
#7 4 23 2017-09-01 2018-08-01 module5 2
Note that you need to change the strings in the function ("module5" and "withdrawn") and the column names (enrolment_date and Enrolment_to) based on what you have in your data.
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L), event = c(42L, 42L, 43L, 43L, 50L, 50L,
34L, 34L, 23L, 23L, 23L, 13L, 13L, 23L, 23L), enrolment_date = c("2012-07-01",
"2018-07-01", "2012-07-01", "2018-07-01", "2017-04-01", "2017-07-01",
"2017-04-01", "2017-07-01", "2014-08-20", "2014-08-20", "2015-07-04",
"2017-09-01", "2017-09-01", "2017-09-01", "2017-09-01"), Enrolment_to = c("2013-06-30",
"2019-06-30", "2013-06-30", "2019-06-30", "2018-03-31", "2018-03-31",
"2018-03-31", "2018-03-31", "2015-07-20", "2015-07-20", "2016-06-04",
"2018-08-01", "2018-08-01", "2018-08-01", "2018-08-01"), course = c("module5",
"module4", "module5", "module4", "module5", "module4", "module5",
"module4", "module5", "module4", "withdrawn", "module4", "module5",
"module4", "module5"), record = c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 3L, 1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -15L))
I struggle a bit with following problem:
I have table A (below) and I would like to merge/reduce/covert intervals defined in there to individual positions like in table B by calculating sum (values in table A) of overlapping positions in intervals (start and end of each interval in table A) if any or just give value if no overlapping positions or 0 if no interval for that position. I would prefer solution for that problem in R. I would really appreciate your help.
Table A
ID Start End Value
1 1 5 9
2 3 7 5
3 5 9 13
4 11 15 1
5 12 16 18
6 14 18 21
Convert to this Table B
Position Value
1 9
2 9
3 14
4 14
5 27
6 18
7 18
8 13
9 13
10 0
11 15
12 33
13 33
14 54
15 54
16 39
17 21
18 21
Not a very straight forward way but it gets the job done:
df<-structure(list(ID = 1:6, Start = c(1L, 3L, 5L, 11L, 12L, 14L),
End = c(5L, 7L, 9L, 15L, 16L, 18L),
Value = c(9L, 5L, 13L, 1L, 18L, 21L)), .Names = c("ID", "Start", "End", "Value"),
class = "data.frame", row.names = c(NA,
-6L))
# create list matrix for each grouping
s1<-lapply(1:6, function(i) {matrix(c(df[i,2]:df[i,3], rep(df[i,4], (df[i,3]-df[i,2]+1))), nrow = (df[i,3]-df[i,2])+1)})
s2<-as.data.frame(do.call(rbind, s1))
#sum all of the like positions
library(dplyr)
wgaps<-summarise(group_by(s2, V1), sum(V2))
#create sequence with no gaps in it and match
nogaps<-data.frame(Position=seq(min(wgaps$V1), max(wgaps$V1)))
nogaps<-left_join(nogaps, wgaps, by=c("Position"= "V1"))
names(nogaps)<-c("Position", "value") #rename
nogaps$value[is.na(nogaps$value)]<-0 #remove 0
I’m looking for a way to employ a lookup algorithm on a dataframe that, for a given element, examines corresponding variables within a range and returns the max such variable. The general gist is that I want the function to (1) look at a given element, (2) find all other elements of the same name, (3) among all elements of the same name, see if a corresponding variable is within +- X of any others, and (4) if so, return the max of those; if not, just return whatever that variable is.
A concrete example is with some time stamp data. Say I have orders for 2 businesses that are classified by date, hour, and minute. I want to look at daily orders, but the problem is that if orders come within 2 minutes of each other, they’re double-counted, so I only want to look at the max value in such cases.
*EDIT: I should say that if orders are logged consecutively within a couple minutes of each other, we assume they're duplicated and only want the max value. So if 4 orders came in, each a minute apart, but then there were no other orders +2 minutes from the last and -2 from the first, we can assume that group of 4 should only be counted once, and it should be the max value that's counted
Here's some data:
data <- structure(list(date = structure(c(16090, 16090, 16090, 16090,
16090, 16090, 16090, 16090, 16090, 16090, 16090, 16090, 16091,
16091, 16091, 16091, 16091, 16091, 16091), class = "Date"), company = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("ABCo", "Zyco"), class = "factor"), hour = c(5L,
5L, 5L, 7L, 7L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 6L, 6L, 6L, 7L, 7L,
7L, 8L), minute = c(21L, 22L, 50L, 13L, 20L, 34L, 47L, 34L, 35L,
20L, 44L, 19L, 14L, 16L, 37L, 24L, 26L, 49L, 50L), orders = c(59L,
46L, 31L, 15L, 86L, 23L, 8L, 71L, 86L, 44L, 23L, 47L, 6L, 53L,
21L, 54L, 73L, 63L, 4L)), .Names = c("date", "company", "hour",
"minute", "orders"), row.names = c(NA, -19L), class = "data.frame")
What I care about here is, for each company, on a given date, within a given hour, if there are any entries that fall +- 2 minutes of each other, I want to take the max value for "orders". If a given entry doesn't have anything within +- 2 minutes of it, then just keep the "orders" value given. (In this case, the first two rows of "data", ABCo on 2014-01-20 at hour=5, since minute 21 and 22 are within +-2 of eachother, we'd return the max value for orders, so 59. The third row, ABCo on 1-20 at hour=5 and minute = 50 has no other minute +-2 from it, so we'd just keep the value for orders, 31)
A starting point to look at the data for minutes and orders in terms of company+date+hour could be to concatenate these 3 terms together and reorganize the data frame:
data$biztime <- do.call(paste, c(data[c("company","date","hour")], sep = "_"))
data2 <- ddply(data, .(biztime, minute), summarise, orders = sum(orders))
But from here I'm lost. Is there any easy way to add another column to this dataframe using an ifelse statement or something else along these lines that does the sort of conditional operation above?
Add a column of datetime objects:
data <- transform(data,
datetime = strptime(sprintf("%s %s:%s", date, hour, minute),
format = "%Y-%m-%d %H:%M"))
Add a column of indices where two rows within two minutes of each other will share the same index:
data <- ddply(data, .(company), transform, timegroup =
cumsum(c(TRUE, diff(datetime, units = "mins") > 2)))
Finally, summarize:
ddply(data, .(company, timegroup), summarise,
orders = max(orders),
datetime = datetime[1])
# company timegroup orders datetime
# 1 ABCo 1 59 2014-01-20 05:21:00
# 2 ABCo 2 31 2014-01-20 05:50:00
# 3 ABCo 3 15 2014-01-20 07:13:00
# 4 ABCo 4 86 2014-01-20 07:20:00
# 5 ABCo 5 53 2014-01-21 06:14:00
# 6 ABCo 6 21 2014-01-21 06:37:00
# 7 ABCo 7 73 2014-01-21 07:24:00
# 8 ABCo 8 63 2014-01-21 07:49:00
# 9 ABCo 9 4 2014-01-21 08:50:00
# 10 Zyco 1 23 2014-01-20 05:34:00
# 11 Zyco 2 8 2014-01-20 05:47:00
# 12 Zyco 3 86 2014-01-20 06:34:00
# 13 Zyco 4 44 2014-01-20 07:20:00
# 14 Zyco 5 23 2014-01-20 07:44:00
# 15 Zyco 6 47 2014-01-20 08:19:00
Unless I misunderstood something, perhaps this is helpful; probably slow, I guess.
data$gr = as.numeric(interaction(data$company, data$date, data$hour))
ff = function(mins, ords) {
unlist(lapply(mins, function(x) max(ords[abs(x - mins) <= 2])))
}
do.call(rbind,
lapply(split(data, data$gr),
function(x) transform(x, new_val = ff(x$minute, x$orders))))
# date company hour minute orders gr new_val
#1.1 2014-01-20 ABCo 5 21 59 1 59
#1.2 2014-01-20 ABCo 5 22 46 1 59
#1.3 2014-01-20 ABCo 5 50 31 1 31
#2.6 2014-01-20 Zyco 5 34 23 2 23
#2.7 2014-01-20 Zyco 5 47 8 2 8
#6.8 2014-01-20 Zyco 6 34 71 6 86
#6.9 2014-01-20 Zyco 6 35 86 6 86
#7.13 2014-01-21 ABCo 6 14 6 7 53
#7.14 2014-01-21 ABCo 6 16 53 7 53
#7.15 2014-01-21 ABCo 6 37 21 7 21
#9.4 2014-01-20 ABCo 7 13 15 9 15
#9.5 2014-01-20 ABCo 7 20 86 9 86
#10.10 2014-01-20 Zyco 7 20 44 10 44
#10.11 2014-01-20 Zyco 7 44 23 10 23
#11.16 2014-01-21 ABCo 7 24 54 11 73
#11.17 2014-01-21 ABCo 7 26 73 11 73
#11.18 2014-01-21 ABCo 7 49 63 11 63
#14 2014-01-20 Zyco 8 19 47 14 47
#15 2014-01-21 ABCo 8 50 4 15 4