I want to pick up rows of which time data is between multiple intervals.
The data frame is like this:
dputs
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B"), score_time = c("2022/09/01 9:00:00", "2022/09/02 18:00:00",
"2022/09/03 12:00:00", NA, NA, "2022/09/15 18:00:00", "2022/09/18 20:00:00",
NA, NA, NA), score = c(243, 232, 319, NA, NA, 436, 310, NA, NA,
NA), treatment_start = c(NA, NA, NA, "2022/09/02 8:00:00", "2022/09/03 11:00:00",
NA, NA, "2022/09/15 8:00:00", "2022/09/16 14:00:00", "2022/09/16 23:00:00"
), treatment_end = c(NA, NA, NA, "2022/09/02 22:00:00", "2022/09/09 12:00:00",
NA, NA, "2022/09/16 2:00:00", "2022/09/16 22:00:00", "2022/09/17 0:00:00"
)), row.names = c(NA, -10L), spec = structure(list(cols = list(
ID = structure(list(), class = c("collector_character", "collector"
)), score_time = structure(list(), class = c("collector_character",
"collector")), score = structure(list(), class = c("collector_double",
"collector")), treatment_start = structure(list(), class = c("collector_character",
"collector")), treatment_end = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x6000000190b0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
ID score_time score treatment_start treatment_end
<chr> <chr> <dbl> <chr> <chr>
1 A 2022/09/01 9:00:00 243 NA NA
2 A 2022/09/02 18:00:00 232 NA NA
3 A 2022/09/03 12:00:00 319 NA NA
4 A NA NA 2022/09/02 8:00:00 2022/09/02 22:00:00
5 A NA NA 2022/09/03 11:00:00 2022/09/09 12:00:00
6 B 2022/09/15 18:00:00 436 NA NA
7 B 2022/09/18 20:00:00 310 NA NA
8 B NA NA 2022/09/15 8:00:00 2022/09/16 2:00:00
9 B NA NA 2022/09/16 14:00:00 2022/09/16 22:00:00
10 B NA NA 2022/09/16 23:00:00 2022/09/17 0:00:00
Multiple score values are given for each ID with the measurement time.
And each ID has more than one information of treatment duration shown by start and end time.
My target is score values that are measured during treatment periods.
I tried with the package lubridate and tidyverse to mutate intervals but could not apply "%in%" method.
Here is my attempt until putting intervals in the same rows with score values.
data %>%
mutate(trt_interval = interval(start = treatment_start, end = treatment_end)) %>%
group_by(ID) %>%
mutate(num = row_number()) %>%
pivot_wider(names_from = num, names_prefix = "intvl", values_from = trt_interval) %>%
fill(c(intvl1:last_col()), .direction = "up")
Desired output is like this.
(The first score of A and the last score of B dismissed because their score_time are out of interval.)
ID score
<chr> <dbl>
1 A 232
2 A 319
3 B 436
I want to know the smarter way to put data in a row and how to apply "%in%" for multiple intervals.
Sorry that the question is not qualified and include multiple steps but any advices will be a great help for me.
Hi I would first create two seperate data frames. One for the scores and one for the intervalls. Then would I join them both and filter the score that are within an treatment intervall.
data_score <- data %>%
filter(!is.na(score_time)) %>%
select(-starts_with("treat")) %>%
mutate(score_time = ymd_hms(score_time))
data_score
data_interval <- data %>%
filter(is.na(score_time)) %>%
select(ID,starts_with("treat")) %>%
mutate(trt_interval = interval(start = treatment_start, end = treatment_end))
data_score %>%
inner_join(
data_interval
) %>%
filter(
lubridate::`%within%`(score_time,trt_interval )
)
Hope this helps!!
Related
I need a chart of accounts to stay in order when new accounts are added or dropped in future years. This is because in Accounting the accounts are sorted by type (for example Asset, Liability Equity) but it is not explicit in the dataset. This is an example of the code that is putting new "Accounts" from Year2 and Year3 at the bottom.
XYZCompany_Consolidated <- XYZCompany_Year1 %>%
full_join(XYZCompany_Year2 by = "Account") %>%
full_join(XYZCompany_Year3, by = "Account")
Example: This picture is just to give a simplified example. The highlight in orange is where the new accounts are going and to the right is the code i'm using, and the green is what I'm trying to achieve
Perhaps I'm overthinking this problem but I find it hard to solve. Let's define some data first:
df_year1 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"JaneDoe"), Year_1 = c(100, 1000, 20, 300, 500)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_1 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year2 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Laptop", "JaneDoe"), Year_2 = c(80, 1200, 50, 300, 500, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year3 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Rent", "JaneDoe"), Year_3 = c(80, 1200, 50, 300, 1000, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_3 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Those are similar to the data shown in the OP's picture, e.g. df_year1 looks like
# A tibble: 5 x 2
Account Year_1
<chr> <dbl>
1 Cash 100
2 Accounts 1000
3 Loan1 20
4 Auto 300
5 JaneDoe 500
Next we transform those data a little bit, namely
library(dplyr)
library(tidyr)
df_y1 <- df_year1 %>%
mutate(Year = 1,
no = row_number()) %>%
rename(value = Year_1)
which returns
# A tibble: 5 x 4
Account value Year no
<chr> <dbl> <dbl> <int>
1 Cash 100 1 1
2 Accounts 1000 1 2
3 Loan1 20 1 3
4 Auto 300 1 4
5 JaneDoe 500 1 5
The new column no stores the account's original position, column Year stores the chart's year. All three data.frames are processed like this, so we get df_y1, df_y2, df_y3.
Finally we bind them together
bind_rows(df_y1, df_y2, df_y3) %>%
mutate(num_years = max(Year)) %>%
group_by(Account) %>%
mutate(rank = sum((num_years - n() + 1) * no), .keep = "unused") %>%
pivot_wider(names_from = Year) %>%
arrange(rank) %>%
select(-rank) %>%
ungroup()
and calculate a rank for each account. The accounts are ordered by this rank. As a result, we get
# A tibble: 7 x 4
Account Year_1 Year_2 Year_3
<chr> <dbl> <dbl> <dbl>
1 Cash 100 80 80
2 Accounts 1000 1200 1200
3 Loan1 20 50 50
4 Auto 300 300 300
5 Laptop NA 500 NA
6 Rent NA NA 1000
7 JaneDoe 500 0 0
Note
I believe, there are better approaches, but at least this works for the example data.
I'm not sure about the calculated rank's stability. Take care.
I have a dataframe with the following structure:
Timestamp Value1 Value2 Problem1 Problem2
00:00 32 40 No No
00:05 12 55 No No
00:10 14 42 Yes No
00:15 50 33 Yes No
00:20 78 47 No No
Where Problem1 defines if there is a problem with Value1, and Problem2 defines if there is a problem with Value2. In case of having a Yes in Problem1, I'd need to replace Value1 by Value2. In case of having problems in both, they should keep unchanged.
My problem here is that I won't know how many Value and Problem columns I'll have. So, in case of having more than 2, I'd need to replace the value with problems by the average of those values without problems.
So, in another example:
Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
00:00 32 40 45 No No No
00:05 12 55 48 No No No
00:10 14 42 55 Yes No No
00:15 50 33 13 Yes No Yes
00:20 78 47 75 No No No
Here I'd need to replace Value1 at 00:10 by the average of Value2 and Value3. Also, I'd need to replace Value1 and Value3 at 00:15 by Value2.
I bet there is a more elegant solution.
library(tidyr)
library(dplyr)
df %>%
mutate(across(starts_with("Problem"), ~ .x == "Yes")) %>%
pivot_longer(-Timestamp, names_to = c("name", "id"), names_pattern = "(.*)(\\d+)") %>%
pivot_wider() %>%
group_by(Timestamp) %>%
mutate(Value = case_when(sum(Problem) == 0 | sum(Problem) == n() | !Problem ~ Value,
TRUE~ sum(Value * (1 - Problem))/sum(1-Problem))) %>%
pivot_longer(cols=c("Value", "Problem")) %>%
mutate(name = paste0(name,id), .keep="unused") %>%
pivot_wider() %>%
ungroup() %>%
mutate(across(starts_with("Problem"), ~ ifelse(.x == 1, "Yes", "No")))
returns
# A tibble: 5 x 7
Timestamp Value1 Problem1 Value2 Problem2 Value3 Problem3
<time> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 00'00" 32 No 40 No 45 No
2 05'00" 12 No 55 No 48 No
3 10'00" 48.5 Yes 42 No 55 No
4 15'00" 33 Yes 33 No 33 Yes
5 20'00" 78 No 47 No 75 No
What approach did I use?
Transform your Problem Variable into a boolean. R is able to use booleans in calculations, technically it is transformed later into a double.
Turn your value/problem numbers into a id, so for every timestamp there are several rows for Value and Problem.
Calculate the new value based on the number of problems and if the value is problematic.
Restore the shape of your data.frame.
Data
df <- structure(list(Timestamp = structure(c(0, 300, 600, 900, 1200
), class = c("hms", "difftime"), units = "secs"), Value1 = c(32,
12, 14, 50, 78), Value2 = c(40, 55, 42, 33, 47), Value3 = c(45,
48, 55, 13, 75), Problem1 = c("No", "No", "Yes", "Yes", "No"),
Problem2 = c("No", "No", "No", "No", "No"), Problem3 = c("No",
"No", "No", "Yes", "No")), problems = structure(list(row = 5L,
col = "Problem3", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Timestamp = structure(list(format = ""), class = c("collector_time",
"collector")), Value1 = structure(list(), class = c("collector_double",
"collector")), Value2 = structure(list(), class = c("collector_double",
"collector")), Value3 = structure(list(), class = c("collector_double",
"collector")), Problem1 = structure(list(), class = c("collector_character",
"collector")), Problem2 = structure(list(), class = c("collector_character",
"collector")), Problem3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
I use the data.table package. I call your data given in the second example "din".
I hope this code helps you:
#I use the library data.table; get data in data.table format
library(data.table)
din <- data.table(din)
din[,Value1:=as.numeric(Value1)]
din[,Value2:=as.numeric(Value2)]
din[,Value3:=as.numeric(Value3)]
#set Values to NA if there is a Problem
din[Problem1=="Yes", Value1:=NA]
din[Problem2=="Yes", Value2:=NA]
din[Problem3=="Yes", Value3:=NA]
#print table with NA replaced if we have a Problem
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32 40 45 No No No
#2: 00:05 12 55 48 No No No
#3: 00:10 NA 42 55 Yes No No
#4: 00:15 NA 33 NA Yes No Yes
#5: 00:20 78 47 75 No No No
#use the mean function to replace if I have an NA in the table (just working if Timestamp is a unique id, otherwise you need to generate one and use this in the by argument)
din[is.na(Value1), Value1:=mean(c(Value2,Value3), na.rm=T), by=Timestamp]
din[is.na(Value2), Value2:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
din[is.na(Value3), Value3:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
#print final table
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32.0 40 45 No No No
#2: 00:05 12.0 55 48 No No No
#3: 00:10 48.5 42 55 Yes No No
#4: 00:15 33.0 33 33 Yes No Yes
#5: 00:20 78.0 47 75 No No No
``
I made a sample for the data:
df <- data.frame(value1 = runif(10, min = 0, max = 100),
value2 = runif(10, min = 0, max = 100),
value3 = runif(10, min = 0, max = 100))
df_problem <- data.frame(problem1 = sample(c('yes','no'), 10, replace = T),
problem2 = sample(c('yes','no'), 10, replace = T),
problem3 = sample(c('yes','no'), 10, replace = T))
See that I separated the values from the problems. Then:
df_problem[df_problem == 'yes'] <- 1
df_problem[df_problem == 'no'] <- NA
df_problem <- matrix(as.numeric(unlist(df_problem)), nrow = nrow(df)) #rebuild matrix
Finally:
df <- df * df_problem
for (i in 1:nrow(df)){
if (T %in% is.na(df[i,])){
df[i,c(which(is.na(df[i,])))] <- mean(unlist(df[i,]), na.rm = T)
}
}
df
I am trying to get the number of encounters a patient has based on dates. This is necessary in R and with the tidyverse library. An example of dataset is here:
structure(list(person_id = c(1, 2, 2, 3, 3, 3), arrival = c("2020-01-01 08:00:00",
"2020-01-01 09:00:00", NA, "2020-01-01 10:00:00", NA, NA), completed = c("2020-01-01 9:00:00",
"2020-01-01 11:00:00", NA, "2020-01-01 11:00:00", NA, NA), admitted = c(NA,
NA, "2020-01-01 11:00:00", NA, "2020-01-01 11:00:00", "2020-01-09 11:00:00"
), discharged = c(NA, NA, NA, NA, "2020/01/02 12:00:00", "2020-01-13 12:00:00"
), encounter_number = c(1, 2, 3, 4, 5, 6)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
And the example of a dataset as an output with the new column I want, shall look like this:
structure(list(person_id = c(1, 2, 2, 3, 3, 3), arrival = c("2020-01-01 08:00:00",
"2020-01-01 09:00:00", NA, "2020-01-01 10:00:00", NA, NA), completed = c("2020-01-01 9:00:00",
"2020-01-01 11:00:00", NA, "2020-01-01 11:00:00", NA, NA), admitted = c(NA,
NA, "2020-01-01 11:00:00", NA, "2020-01-01 11:00:00", "2020-01-09 11:00:00"
), discharged = c(NA, NA, NA, NA, "2020/01/02 12:00:00", "2020-01-13 12:00:00"
), encounter_number = c(1, 2, 3, 4, 5, 6), person_total_encounter = c(1,
1, 2, 1, 2, 1)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Caveats!!
As you can see person with person_id number 2 arrives at AE and completed from AE and then admitted immediately at hospital but with no discharge date yet. But still I give an total encounter adding up to 2, one for AE and one for Inpatient, again even without date of discharge. Moreover, person with person_id 3 is admitted twice on different dates, but this gets a total encounter of up to 2, and then 1 for the last admission. Can someone help me with this?
Here is one thought. If this is not what you had in mind please let me know.
First, you can put your data into long format, with event in one column (for arrive, completed, admitted, discharged), and date in a second column. And remove NA which do not contribute to results.
Then you can filter on events you wish to count. In this case, I selected completed and admitted.
Next you can group_by both person_id and date (just date without time). The person_total_encounter will be the row_number, which is just the running count or sequence of events for that date and person_id.
Edit: Added select at the beginning since the original dataset described by OP likely has additional columns.
library(tidyverse)
df1 %>%
select(person_id, encounter_number, arrival, completed, admitted, discharged) %>%
pivot_longer(cols = c(arrival, completed, admitted, discharged), names_to = "event", values_to = "date") %>%
drop_na() %>%
filter(event == "completed" | event == "admitted") %>%
group_by(person_id, date = as.Date(date)) %>%
mutate(person_total_encounter = row_number()) %>%
ungroup %>%
select(-c(event, date)) %>%
right_join(df1, by = c("person_id", "encounter_number"))
Output
# A tibble: 6 x 7
person_id encounter_number person_total_encounter arrival completed admitted discharged
<dbl> <dbl> <int> <chr> <chr> <chr> <chr>
1 1 1 1 2020-01-01 08:00:00 2020-01-01 09:00:00 NA NA
2 2 2 1 2020-01-01 09:00:00 2020-01-01 11:00:00 NA NA
3 2 3 2 NA NA 2020-01-01 11:00:00 NA
4 3 4 1 2020-01-01 10:00:00 2020-01-01 11:00:00 NA NA
5 3 5 2 NA NA 2020-01-01 11:00:00 2020-01-02 12:00:00
6 3 6 1 NA NA 2020-01-09 11:00:00 2020-01-13 12:00:00
I have a df with timesheet data and am looking for an easy way to summarize it. My data looks like df1 and I want to summarize it as df2. I am having a hard time devising a way to create the increments and allocate hours across them. The tricky part is allocating the hours that span across dates, ID 1 and 3, for example.
df1
ID Garage Unit_Name START_DATE_TIME END_DATE_TIME
<chr> <chr> <chr> <dttm> <dttm>
1 A Truck 1/26/2015 21:00 1/27/2015 7:00
2 B Truck 5/13/2015 6:00 5/13/2015 16:00
3 C Car 8/21/2015 21:00 8/22/2015 7:00
6 C Car 8/21/2015 11:00 8/21/2015 21:00
structure(list(ID = c("<chr>", "1", "2", "3", "6", NA, NA, NA,
NA, NA, NA), Garage = c("<chr>", "A", "B", "C", "C", NA, NA,
NA, NA, NA, NA), Unit_Name = c("<chr>", "Truck", "Truck", "Car",
"Car", NA, NA, NA, NA, NA, NA), START_DATE_TIME = c("<dttm>",
"1/26/2015 21:00", "5/13/2015 6:00", "8/21/2015 21:00", "8/21/2015 11:00",
NA, NA, NA, NA, NA, NA), END_DATE_TIME = c("<dttm>", "1/27/2015 7:00",
"5/13/2015 16:00", "8/22/2015 7:00", "8/21/2015 21:00", NA, NA,
NA, NA, NA, NA)), .Names = c("ID", "Garage", "Unit_Name", "START_DATE_TIME",
"END_DATE_TIME"), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"), spec = structure(list(cols = structure(list(
ID = structure(list(), class = c("collector_character", "collector"
)), Garage = structure(list(), class = c("collector_character",
"collector")), Unit_Name = structure(list(), class = c("collector_character",
"collector")), START_DATE_TIME = structure(list(), class = c("collector_character",
"collector")), END_DATE_TIME = structure(list(), class = c("collector_character",
"collector"))), .Names = c("ID", "Garage", "Unit_Name", "START_DATE_TIME",
"END_DATE_TIME")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
df2
Garage Unit_Name Date Increment Hours
<chr> <chr> <dttm> <chr> <dbl>
A Truck 1/26/2015 18:01-00:00 3
A Truck 1/27/2015 00:01-6:00 6
A Truck 1/27/2015 6:01-12:00 1
B Truck 5/13/2015 6:01-12:00 6
B Truck 5/13/2015 12:01-18:00 4
C Car 8/21/2015 6:01-12:00 1
C Car 8/21/2015 12:01-18:00 6
C Car 8/21/2015 18:01-00:00 6
C Car 8/22/2015 00:01-6:00 6
C Car 8/23/2015 6:01-12:00 1
library(tidyverse)
library(lubridate)
times=c("00:00","06:00","12:00","18:00")
times1=c("00:01","06:01","12:01","18:01")
df1%>%
group_by(Garage,Unit_Name)%>%
mutate(size=n())%>%
summarise(START_DATE_TIME=min(START_DATE_TIME),
END_DATE_TIME=max(END_DATE_TIME))%>%
mutate(S=mdy_hm(START_DATE_TIME),
b=floor(hour(S)/24*4)+1,
m=ymd_hm(paste(format(S,"%F"),get("times",.GlobalEnv)[b])),
n=ymd_hm(paste(format(S,"%F"),get("times",.GlobalEnv)[(b+1)%%4%>%replace(.,.==0,4)]))%>%
if_else(m>.,.+days(1),.),
rem=as.numeric(mdy_hm(END_DATE_TIME)-n),
HOURS=list(as.numeric(c(n-S,rep(6,rem%/%6),rem%%6))))%>%
unnest()%>%
mutate(Date=S+hours(cumsum(lag(HOURS,default = 0))),
b=floor(hour(Date)/24*4)+1,
increament=paste0(get("times1",.GlobalEnv)[b],"-",
get("times",.GlobalEnv)[replace(d<-(b+1)%%4,d==0,4)]),
Date=as.Date(Date))%>%
select(Garage,Date,HOURS,increament)
Groups: Garage [3]
Garage Date HOURS increament
<chr> <date> <dbl> <chr>
1 A 2015-01-26 3. 18:01-00:00
2 A 2015-01-27 6. 00:01-06:00
3 A 2015-01-27 1. 06:01-12:00
4 B 2015-05-13 6. 06:01-12:00
5 B 2015-05-13 4. 12:01-18:00
6 C 2015-08-21 1. 06:01-12:00
7 C 2015-08-21 6. 12:01-18:00
8 C 2015-08-21 6. 18:01-00:00
9 C 2015-08-22 6. 00:01-06:00
10 C 2015-08-22 1. 06:01-12:00
I have a dataframe (example below) which has one time and 2 other variables
data<- data.frame(structure(list(datetime = c("7/17/2017 8:16:53", "7/17/2017 8:16:55",
"7/17/2017 8:16:57", "7/17/2017 8:16:59", "7/17/2017 8:17:01",
"7/17/2017 8:17:02", "7/17/2017 8:17:04", "7/17/2017 8:17:06",
"7/17/2017 8:17:08", "7/17/2017 8:17:10", "7/17/2017 8:17:12",
"7/17/2017 8:17:13", "7/17/2017 8:17:15", "7/17/2017 8:17:17",
"7/17/2017 8:17:19", "7/17/2017 8:17:21", "7/17/2017 8:17:22",
"7/17/2017 8:17:27", "7/17/2017 8:17:29", NA, NA), var1 = c(252.234873,
254.0436836, 252.5279108, 252.4802478, 252.6377229, 253.8766496,
249.8086397, 249.5646219, 249.1815691, 253.9509387, 251.7245156,
251.8415925, 254.2059507, 253.9145112, 251.8415925, 254.2059507,
253.9145112, 252.4802478, 252.6377229, NA, NA), var2 = c(582.5766695,
583.0972735, 582.7872586, 582.312636, 579.6445667, 579.7995196,
578.9574528, 576.5341483, 575.8460797, 574.2353493, 574.8998519,
574.1717159, 573.8133058, 574.6849578, 574.1717159, 573.8133058,
574.6849578, 582.312636, 579.6445667, NA, NA)), .Names = c("datetime",
"var1", "var2"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-21L), spec = structure(list(cols = structure(list(datetime = structure(list(), class = c("collector_character",
"collector")), var1 = structure(list(), class = c("collector_double",
"collector")), var2 = structure(list(), class = c("collector_double",
"collector"))), .Names = c("datetime", "var1", "var2")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec")))
I would like to filter my data based on the time variable into different periods. I have listed the periods between as From and To as example below
tab_filt <- data.frame(structure(list(From = c("7/17/2017 8:16:53", "7/17/2017 8:17:04",
"7/17/2017 8:17:19"), To = c("7/17/2017 8:16:59", "7/17/2017 8:17:10",
"7/17/2017 8:17:27")), .Names = c("From", "To"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L), spec = structure(list(
cols = structure(list(From = structure(list(), class = c("collector_character",
"collector")), To = structure(list(), class = c("collector_character",
"collector"))), .Names = c("From", "To")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec")))
To ease your help I have also converted time into Posixct for the example data
data$datetime <- as.POSIXct(strptime(data$datetime, format="%m/%d/%Y %H:%M:%S"))
tab_filt$From <- as.POSIXct(strptime(tab_filt$From, format="%m/%d/%Y %H:%M:%S"))
tab_filt$To <- as.POSIXct(strptime(tab_filt$To, format="%m/%d/%Y %H:%M:%S"))
I would to like to know how could I filter my data only for the periods in the second table.
Please Help
Let me know if you need any additional details :)
Here is a neat way using the packge lubridate:
library(lubridate)
library(dplyr)
# create intervals using %--%
ints <- tab_filt$From %--% tab_filt$To
# check for each row if datetime lies in any of the intervals using %within%
data %>%
rowwise() %>%
mutate(In = any(datetime %within% ints))
This results in
# A tibble: 21 x 4
datetime var1 var2 In
<dttm> <dbl> <dbl> <lgl>
1 2017-07-17 08:16:53 252. 583. TRUE
2 2017-07-17 08:16:55 254. 583. TRUE
3 2017-07-17 08:16:57 253. 583. TRUE
4 2017-07-17 08:16:59 252. 582. TRUE
5 2017-07-17 08:17:01 253. 580. FALSE
6 2017-07-17 08:17:02 254. 580. FALSE
7 2017-07-17 08:17:04 250. 579. TRUE
8 2017-07-17 08:17:06 250. 577. TRUE
9 2017-07-17 08:17:08 249. 576. TRUE
10 2017-07-17 08:17:10 254. 574. TRUE
# ... with 11 more rows
where In = FALSE indicates that these rows should be dropped. To do so, just add %>% filter(In) to the pipe above.
First of all I must thank OP for adding sample data and related commands to convert to date fields.
One can use data.table to join data with tab_filt to filter data which falls within range of From and To defined in tab_filt:
library(data.table)
setDT(data)
setDT(tab_filt)
data[tab_filt, .(x.datetime,x.var1,x.var2), on=.(datetime <= To, datetime >= From)]
# x.datetime x.var1 x.var2
# 1: 2017-07-17 08:16:53 252.2349 582.5767
# 2: 2017-07-17 08:16:55 254.0437 583.0973
# 3: 2017-07-17 08:16:57 252.5279 582.7873
# 4: 2017-07-17 08:16:59 252.4802 582.3126
# 5: 2017-07-17 08:17:04 249.8086 578.9575
# 6: 2017-07-17 08:17:06 249.5646 576.5341
# 7: 2017-07-17 08:17:08 249.1816 575.8461
# 8: 2017-07-17 08:17:10 253.9509 574.2353
# 9: 2017-07-17 08:17:19 251.8416 574.1717
# 10: 2017-07-17 08:17:21 254.2060 573.8133
# 11: 2017-07-17 08:17:22 253.9145 574.6850
# 12: 2017-07-17 08:17:27 252.4802 582.3126