Find row where multiple criteria are met, return a different column, for a full dataset - r

I have two datasets; 'data' and 'noiseaware'. noiseaware contains a RoomCode and a time stamp.
RoomCode last_trigger
GTX-513 2020-05-09 00:30:28
data contains a ton of things, including a reservation code, a check-in time stamp, a check-out time stamp, and a RoomCode. Ie
ReservationID RoomCode checkin_time checkOutDate
25307070gawgw GTX-513 2020-04-09 00:30:28 2020-05-09 00:30:28
My objective is that for each line in noiseaware, I want to find the corresponding reservation ID that matches the following combination:
Is after the checkInDate
Is before the checkOutDate
Has the same RoomCode
That in logic is as follows:
noiseaware$last_trigger <= data$checkOutDate & noiseaware$last_trigger >= data$checkInDate & data$RoomCode == noiseaware$RoomCode
However, I can't work out how to turn that logic - which returns a vector of true and false values - into something that returns the ReservationId. If it makes any difference, there should only be one matching ID for the above criteria.
Once I can do that, I'd then want to loop through and do the same for each line in noiseaware. I suppose I could do that with lapply?

Sounds like something dplyr can handle easily.
You will need to left_join table noiseaware to data by RoomCode.
And then filter out the samples you don't need.
Here's an example. Without a sample data, I have no way to test this. You may need to tweak these codes to accommodate the actual data. But the basic idea is there.
library("dplyr")
noiseaware %>%
left_join(data, by = "RoomCode") %>%
filter(last_trigger > checkin_time & last_trigger < checkOutDate)

An option using data.table:
library(data.table)
setDT(noiseaware)[, last_trigger :=
setDT(data)[.SD, on=.(RoomCode, checkInDate<=last_trigger, checkOutDate>=last_trigger),
mult="last", x.ReservationID]
]
mult="last" uses the last observation if there are multiple results for a row in noiseaware.

Related

Conditional sub-setting or nulling

I have to include participants into a dataframe(or existing data frame) if they have higher score in invalid conditions relative to valid conditions. But I have two times of (T1-T3) data.
I have tried this one: data_new <- subset(data_raw, T1_invalid > T1_valid & T3_invalid > T3_valid)
However, it did not work because, for instance, some participants may have higher invalid score in just one time (T1), not in the second time (T3), or vice versa.
For example, a person can have higher invalid in one of the times, let's say T1_invalid > T1_valid. This should be included to the new data frame, it is okay. But, T3_invalid - T3_valid should be excluded because the invalid score is not higher than the valid score. But when you use AND operator, it excludes the person because, they have to have higher invalid scores in both T1 and T3. So, we over exclude in that case.
When you use OR operator it is the same. For example, a person has a higher score in T1_invalid > T1_valid, but not in the T3_invalid - T3_valid. Then, since one of the conditions is okay, it includes the person, but this person failed at T3. So, we should exclude T3_invalid - valid scores.
So basically, I was looking for something can check them separately. Then, I decided to make it null one by one like this:
data_raw[data_raw$T1_invalid < data_raw$T1_valid, c("T1_invalid", "T1_valid")] <- NA
data_raw[data_raw$T3_invalid < data_raw$T3_valid, c("T3_invalid", "T3_valid")] <- NA
However, it did not let me do this because I use the variables two times, for the condition part (>) and for make it null.
Does anyone have any idea? By the way they have to be in the same data frame for using in the model.
Here I provide a normal data.table solution. You can have a try.
library(data.table)
setDT(data_raw)
data_raw[, T1_invalid := ifelse(T1_invalid < T1_valid,NA,T1_invalid)]
data_raw[, T1_valid := ifelse(T1_invalid < T1_valid,NA,T1_valid)]
data_raw[, T3_invalid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]
data_raw[, T3_valid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]

How to check for the presence of multiple strings for each value of a particular column in R Dataframe?

How do we identify all those row entries in a particular column that contain a specific set of keywords?
For example, I have the following dataframe:
test <- data.frame(nom = 1:5, name = c("ser bla", "onlybla", "inspectiongfa serdafds", "inspection", "serbla blainspection"))
My keywords of interest are "ser" & "inspection"
What I'm looking for is to enlist all the values of the second column (i.e. name) in which both the keywords are present together.
So basically, my output should enlist the name values of rows 3 and 4 viz. "inspectiongfa serdafds" & "serbla blainspection"
What I have tried is the following:
I first generate a truth table to enlist the presence of each of the keywords for each row in the dataframe as follows:
as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
Once I get this, all I have to do is identify all those row entries where the values are a pair of TRUE TRUE. Hence, they'll correspond to the cases where the keywords of interest are present. Here it's the same rows 3 & 4.
But, I'm not able to figure out how to identify such row entries with the TRUE TRUE pair and whether this whole process is a bit of an overkill and it can be done in a much efficient manner.
Any help would be appreciated. Thanks!
You're almost there :)
Here's a solution extending what you have done:
# store your logic test outcomes
conditions_df <- as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
# False=0 & True=1. Can use rowSums to get the total and find ones that =2 ie True+True
# which gives you the indices of the TRUE outcomes ie the rows we need to filter test
locate_rows <- which(rowSums(conditions_df) == 2)
test$name[locate_rows]
[1] "inspectiongfa serdafds"
[2] "serbla blainspection"

How do I pull the values from multiple columns, conditionally, into a new column?

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!
I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If
This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

R set column value to be other column value based on string search

I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])

R- Speed up calculation related with subset of data.table

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?
I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

Resources