Finding all ids available at a particular time - r

I have a binary matrix that gives the indication whether a person (ID) is available at a time to do a job. The example matrix is
08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30 19:00
1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0
2 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0
3 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
4 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
5 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
6 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
19:30
1 0
2 0
3 0
4 0
5 0
6 0
The row names represent the IDs and the time showed are the ones where the IDs are available. In the example, IDs 1 and 2 start work at 8:00, and have specific break periods at 10:30-11:00, 13:00- 13:30. The persons that start half and hour later 3 and 4 takes break from 11:00-11:30, 13:30-14:00. This is to ensure that somebody is available to do a job that can start at any particular point.
dput(matrix)
structure(c(1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), .Dim = c(6L, 24L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6"), c("08:00", "08:30", "09:00", "09:30", "10:00", "10:30",
"11:00", "11:30", "12:00", "12:30", "13:00", "13:30", "14:00",
"14:30", "15:00", "15:30", "16:00", "16:30", "17:00", "17:30",
"18:00", "18:30", "19:00", "19:30")))
Another dataset have the "IDs" with their starting time
data1 <- data.frame(ID = 1:6, Start_Time = c("8:00", "8:00", "8:30",
"8:30", "9:00", "9:30"), stringsAsFactors=FALSE)
A third dataset will have the start and end timings for a particular task
data2 <- data.frame(Start = c("8:01", "9:35", "10:42", "11:25", "14:22",
"17:20", "18:19"), End = c("8:22", "9:42", "11:20", "11:32",
"14:35", "18:15", "18:25"), stringsAsFactors=FALSE)
I am trying to create a column in data2 that gives the IDs available to do the task based on the Start time in data2. The expected output is
data2$IdsAvail <- c("1, 2", "1, 2, 3, 4, 5, 6", "3, 4, 5, 6",
"1, 2, 5, 6", "1, 2, 3, 4", "3, 4, 5, 6", NA)
It would look like below
data2
Start End IdsAvail
1 8:01 8:22 1, 2
2 9:35 9:42 1, 2, 3, 4, 5, 6
3 10:42 11:20 3, 4, 5, 6
4 11:25 11:32 1, 2, 5, 6
5 14:22 14:35 1, 2, 3, 4
6 17:20 18:15 3, 4, 5, 6
7 18:19 18:25 <NA>
Tried to match the IDs with time in the matrix, but couldn't find a way. It is also possible that two jobs can come within the time frame where one person is doing a job. I am not taking that into consideration here. This just to get the initial IDs available based on the matrix.
EDIT: The below solution by #Audiophile works for the example, but it throws a warning here having duplicates
availability <- merge(availability,data2,by.x = 'time',by.y = 'slot',all.y = T)
I had to use allow.cartesian to make it work in the original dataset. My dataset have about 2000 rows, after using merge it gives about >20000 rows. The above merge step using this example also give different number of rows than in 'availability' or 'data2'. Is there any other method i.e. using foverlaps from data.table?

Identify the slots for which each person is available, and then merge it with the task list:
library(tidyr)
library(dplyr)
#Convert your availability matrix (mat1) to a data frame
df <- as.data.frame(mat1)
df$ID <- rownames(df)
#Reshape the availability dataset
availability <- df %>%
gather(time,available,-ID) %>%
filter(available==1) %>%
mutate(time = as.POSIXct(time,format = "%H:%M"))
data1$Start_Time <- as.POSIXct(data1$Start_Time,format = "%H:%M")
data2$Start <- as.POSIXct(data2$Start,format = "%H:%M")
#Use start times to refine availability dataset
availability <- merge(availability,data1,by = "ID")
availability <- availability %>%
filter(time>=Start_Time) %>%
select(ID,time)
#Round task time to nearest half hour slot
data2$slot <- as.POSIXct(floor(as.double(data2$Start)/1800)*1800,
format = "%H:%M",origin = as.POSIXct('1970-01-01',tz='UTC'))
availability <- merge(availability,data2,by.x = 'time',by.y = 'slot',all.y = T)
availability <- availability %>%
select(Start,End,ID) %>%
arrange(Start,ID) %>%
group_by(Start,End) %>%
summarise(IdsAvail = toString(ID)) %>%
ungroup() %>%
mutate(Start = format(Start,"%H:%M"))

Related

rolling computation to fill gaps by finding following or previous values in a data.table time series

I have a data.table that looks like this:
tsdata <- data.table(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
signal = c(0, 1, 1, 0, 0, 1, 0, 0, 0, 1))
I am trying to fill the gaps between the ones, but only if the gap of zeros is small. So a flexible solution to define the gap would be nice. In this example the gap with zeros shouldn't be bigger than 2.
The result should look like this:
tsdata <- data.table(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
signal = c(0, 1, 1, 1, 1, 1, 0, 0, 0, 1))
My real time series data is much bigger than this, so any help is appreciated.
Group by rleid(signal) and then fill in short 0 sequences not at the beginning or end with 1.
tsdata[, signal2 := ifelse(signal[1] == 0 &
.N <= 2 &
time[1] > min(tsdata$time) &
time[.N] < max(tsdata$time), 1, signal),
by = rleid(signal)]
tsdata
giving:
time signal signal2
1: 1 0 0
2: 2 1 1
3: 3 1 1
4: 4 0 1
5: 5 0 1
6: 6 1 1
7: 7 0 0
8: 8 0 0
9: 9 0 0
10: 10 1 1

Create column conditioning the behavior of rows in the dataset

I would like to do something very specific. I have a vast set of data, which, in summary, looks more or less like this, with values 0, 1 and 2:
I need to create a situation variable so that it contains the value 0, 1 and 2.
The value 0 for cases that contain only 0's and 1's in the entire line.
The value 1 for the case where the value 2 appears, but at some point 1 appears before it.
The value 2 for the case where the value 2 appears, but at some point 0 appears before it.
So it's something close to:
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1), X2 = c(1,
1, 1, 1, 0, 0, 0, 0, 0, 2), X3 = c(0, 1, 1, 1, 1, 0, 0, 1, 0,
0), X4 = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0), X5 = c(2, 1, 1, 0,
2, 1, 1, 0, 0, 0), X6 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0), X7 = c(2,
1, 1, 1, 2, 1, 1, 2, 0, 0), X8 = c(0, 1, 1, 1, 2, 1, 2, 2, 2,
0)), class = "data.frame", row.names = c(NA, 10L))
I wrote a score function and applied it over all the rows of your dataframe.
score <- function(x) {
a <- which(x == 2)
ifelse(length(a) > 0, ifelse(a[1] >=2, 2 - x[a[1] - 1], 1), 0)
}
df <- structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1),
X2 = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 2),
X3 = c(0, 1, 1, 1, 1, 0, 0, 1, 0, 0),
X4 = c(0, 1, 1, 0, 1, 1, 0, 0, 0, 0),
X5 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0),
X6 = c(2, 1, 1, 0, 2, 1, 1, 0, 0, 0),
X7 = c(2, 1, 1, 1, 2, 1, 1, 2, 0, 0),
X8 = c(0, 1, 1, 1, 2, 1, 2, 2, 2, 0)),
class = "data.frame", row.names = c(NA, 10L))
df$situation <- sapply(1:nrow(df), function(i) score(as.numeric(df[i,])))
df
Here's a tidyverse approach.
I'll first concatenate all columns together, then use grepl() to look for 12 or 02.
library(tidyverse)
df %>% rowwise() %>%
mutate(concat = paste(c_across(everything()), collapse = "")) %>%
ungroup() %>%
mutate(situation = case_when(
!grepl(2, concat) ~ 0,
grepl("12", concat) ~ 1,
grepl("02", concat) ~ 2
)) %>%
select(-concat)
Output
# A tibble: 10 x 9
X1 X2 X3 X4 X5 X6 X7 X8 situation
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 2 2 2 0 2
2 1 1 1 1 1 1 1 1 0
3 1 1 1 1 1 1 1 1 0
4 1 1 1 0 0 0 1 1 0
5 1 0 1 1 2 2 2 2 1
6 1 0 0 1 1 1 1 1 0
7 1 0 0 0 1 1 1 2 1
8 1 0 1 0 0 0 2 2 2
9 0 0 0 0 0 0 0 2 2
10 1 2 0 0 0 0 0 0 1
Note that this solution assumes that:
2 will not appear in the first column
1 or 2 in the situation is defined by the number immediately before 2 in your dataset
There will not be a case of 12 and 02 happening in the same row

Cumulative count for a column using R

I got data like this
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
I would like to get the cumulative count of each column for each id and get the data like this
structure(list(id2 = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1_b = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2_b = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
You can get a cumulative sum with cumsum.
To split data.frame into subsets, you can use split and then lapply cumsum over the list of the data.frames and again over the list of the columns, or you can use the ave function which does exactly that:
data = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
data[-1] = ave(data[-1], data$id, FUN=cumsum)
edit:
I assumed that the cumulative sum is requested (as per instructions) and that there is a mistake in the example data. If the example data is correct, then the condition is If the count is zero, don't do cumulative sum and leave at zero or ifelse(x == 0, 0, cumsum(x)) (as per #r2evans). However, this construct doesn't work when applied for the data.frame. A more complex helper function is required:
data[-1] = ave(data[-1], data$id, FUN=function(x){
y = cumsum(x)
y[x == 0] = 0
y
})
We can now compare it with the requested (renamed) data:
result = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2 = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
identical(data, result)
Base R,
ave(df$drug_2, df$id, FUN = function(z) ifelse(z == 0, z, cumsum(z)))
# [1] 0 1 2 3 4 0 5 0 0 1 0 2
Edit Simplified the solution after reading r2evans' approach.
You could use
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("drug"),
~ifelse(.x == 0, 0, cumsum(.x)))) %>%
ungroup()
This returns
# A tibble: 12 x 3
id drug_1 drug_2
<dbl> <dbl> <dbl>
1 1 0 0
2 1 0 1
3 1 0 2
4 1 0 3
5 1 0 4
6 1 1 0
7 1 2 5
8 2 0 0
9 2 0 0
10 2 1 1
11 2 0 0
12 2 2 2
Base R solution:
# Resolve the names of vectors we want to cumulatively sum:
# drug_vec_names => character vector
drug_vec_names <- grep( "^drug\\_", colnames(df), value = TRUE)
# Resolve the names of vectors we want to keep:
# not_drug_vec_names => character vector
not_drug_vec_names <- names(df)[!(names(df) %in% drug_vec_names)]
# Calculate the result: res => data.frame
res <- setNames(
cbind(
df[,not_drug_vec_names],
replace(
ave(
df[,drug_vec_names],
df[,not_drug_vec_names],
FUN = cumsum
),
df[,drug_vec_names] == 0,
0
)
),
c(not_drug_vec_names, drug_vec_names)
)
If you have binary values (1/0) in drug columns, you can multiply the cumulative sum with itself to get 0 for 0 values.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('drug'), ~cumsum(.) * .)) %>%
ungroup
# id drug_1 drug_2
# <dbl> <dbl> <dbl>
# 1 1 0 0
# 2 1 0 1
# 3 1 0 2
# 4 1 0 3
# 5 1 0 4
# 6 1 1 0
# 7 1 2 5
# 8 2 0 0
# 9 2 0 0
#10 2 1 1
#11 2 0 0
#12 2 2 2

How can I select the relevant data by group in the data frame if the 1st row by group satisfy specific condition in R?

The data set has 3 columns-the 1st column is "id", the 2nd column is "treatment", and the 3rd column is "time". The 2nd column is a binary variable. Now, I want to extract the data by group based on the rule as follows.
1)Within each id, as long as the first row satisfy the condition of (time=1 and treatment=0),then we select the whole group data across id.
To sum, the expected data set should look like this,
id treatment time
1 0 1
1 0 2
1 0 3
1 0 4
1 0 5
1 0 6
1 0 7
1 NA 8
1 0 9
1 0 10
3 0 1
3 NA 2
3 1 3
3 1 4
3 1 5
3 1 6
3 1 7
3 NA 8
3 1 9
3 1 10
5 0 1
5 NA 2
5 0 3
5 0 4
5 0 5
5 0 6
5 0 7
5 0 8
5 0 9
5 0 10
The original data set with errors is structured as follows,
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), treatment = c(0, 0, 0, 0, 0, 0, 0, NA, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, NA, 1, 1, 1, 1, 1, NA, 1, 1,
NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0),
time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), row.names = c(NA,
50L), class = "data.frame")->dataframe
Thank you!
You can filter on the first value in each group.
library(dplyr)
dataframe %>%
group_by(id) %>%
filter(first(time) == 1 && first(LEP) == 0) %>%
ungroup
# id gender LEP time
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 0 1
# 2 1 0 0 2
# 3 1 0 0 3
# 4 1 0 0 4
# 5 1 0 0 5
# 6 1 0 0 6
# 7 1 0 0 7
# 8 1 0 NA 8
# 9 1 0 0 9
#10 1 0 0 10
# … with 20 more rows
Another approach in base R would be to extract only the first row of each id and keep only id which has LEP = 0 and time = 1.
subset(dataframe, id %in% id[!duplicated(id) & LEP == 0 & time == 1])

How to create multiple data frame from one data frame with multiple condition in R

I would like to create four data sets from the following given data frame by multiple conditions in x1 and x2
mydata=structure(list(y = c(-3, 24, 4, 5, 3, -3, -3, 24, 5, 4, 8, 7,
9, 2, 4, 8, 7, 3, 8, 12, 9, 10, 12, 11, 2),
x1 = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1
),
x2 = c(1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0)), class = "data.frame",
row.names = c(NA,25L))
The first data set is mydata00 which is constructed with these conditions x1=0 and x2=0,
mydata00=filter(mydata, c(mydata$x1==0 & mydata$x2==0))
> mydata00
y x1 x2
1 -3 0 0
2 -3 0 0
3 8 0 0
4 3 0 0
5 9 0 0
Now, I need only the unique values of y and corresponding x1 and x2. Finally, I would like to sort y. So my final data set must look like
y x1 x2
1 -3 0 0
2 3 0 0
3 8 0 0
4 9 0 0
I would like to do the job for mydata11, mydata10, mydata01, where ,
mydata11=filter(mydata, c(mydata$x1==1 & mydata$x2==1))
mydata10=filter(mydata, c(mydata$x1==1 & mydata$x2==0))
mydata01=filter(mydata, c(mydata$x1==0 & mydata$x2==1))
Can I use any for loop or builtin functionn to create these data sets?
Any help is appreciated.
We can split the data based on unique values of x1 and x2 and get unique rows in each list after ordering it by y.
temp <- lapply(split(mydata, list(mydata$x1, mydata$x2)), function(x)
unique(x[order(x$y), ]))
temp
#$`0.0`
# y x1 x2
#6 -3 0 0
#18 3 0 0
#16 8 0 0
#21 9 0 0
#$`1.0`
# y x1 x2
#14 2 1 0
#5 3 1 0
#10 4 1 0
#4 5 1 0
#...
If we need data as a separate dataframe, we can name them appropriately and use list2env.
names(temp) <- paste0("mydata", names(temp))
list2env(temp, .GlobalEnv)
tidyverse way of doing this would be :
library(tidyverse)
mydata %>% group_split(x1, x2) %>% map(~.x %>% arrange(y) %>% distinct)

Resources