Improve efficiency of lookup algorithm in R - r

I think this is an interesting task to optimize a piece of R code.
I have a dataframe df_red which details from orders of a webshop. For each product (ean), I want to get the 12 most likely other products to be in a basket with it.
This is the sample code to generate such data set:
library(tidyverse)
# create a vector with 1400 products (characterized by their EANs)
eans <- sample(1e5:1e6, 1400, replace = FALSE)
# create a vector with 200k orders
basket_nr <- 1:2e5
# a basket can have up to 4 items, it's most likely to have 3 items
n_prod_per_basket <- sample(x = 1:4, length(basket_nr), prob = c(0.2, 0.2, 0.5, 0.1), replace = TRUE)
# create df_red, each line of which correspond to a product with it's respective basket number
df <- data_frame(basket_nr, n_prod_per_basket)
df_red <- data_frame(basket_nr = rep(basket_nr, n_prod_per_basket))
df_red$ean <- sample(x = eans, nrow(df_red), replace = TRUE)
The code I am using to accomplish this task is the following. But I am sure it's not an efficient one. How can I increase the speed of the program?
ean <- unique(df_red$ean)
out <- list()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}

Performance improvements are of course a matter of degree. How far to go before it is improved "enough" is hard to say. However, we can reduce run time by about 25% by functionalizing your code and cleaning up the subsetting logic. Starting with your code:
#added a timer
start.time <- Sys.time()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}
Sys.time() - start.time
This takes between 30-34 seconds on my machine. However we can rewrite it as a function like so:
my.top12.func <- function(id, df_red) {
#improved subsetting logic - using which is faster and we can remove some code by
#removing the ean that is being iterated in the filter step below
prod <- df_red[df_red$basket_nr %in% df_red$basket_nr[which(df_red$ean == id)], ]
# set cutoff from 12 to 13 since the specific ean will always be one of the top 12
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13 & ean != id) #additional filter required
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) return(data_frame(ean = id, recom = top12$ean, freq = top12$n))
}
Now we can test the speed and accuracy of this approach by doing:
start.time <- Sys.time()
my.out <- lapply(ean, my.top12.func, df_red = df_red)
Sys.time() - start.time
#test for equality
all.equal(out, my.out)
Which is about 24-26 seconds for a 25%+ improvement.

Playing around with data.table I have the output produced in under 7 seconds (which I guess is about 80% improvement):
library(data.table)
setDT(df_red)
all_eans <- df_red[, unique(ean)]
k <- lapply(all_eans, function(x) {
df_red[basket_nr %in% df_red[ean == x, unique(basket_nr)],
.N,
by = ean][order(-N)][2:13]
}
)
names(k) <- all_eans
k <- k[sapply(k, nrow) == 12]

I would consider not using a loop.
df_red$k <- 1
df_s <- left_join(df_red, df_red, by = "k") %>%
filter(ean.x != ean.y & basket_nr.x == basket_nr.y) %>%
group_by(ean.x) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13)
df_s.ct <- df_s %>% filter(row_number() == 12)
df_s.fin <- df_s[df_s$ean.x %in% df_s.ct$ean.x, ]
The rate limiting step in this is the left_join which merges the dataset to itself, creating an exponentially larger dataset (so if you have 50,000 points, then you will end up creating a new dataset that is 2.5B points). It now indicates that the best way to store and manipulate the data is using data.table, which will increase the speed of this procedure, especially when combined with dplyr.

Related

Is there a way to isolate multiple index points in a dataset and isolate a time window around that index point in R?

I have a dataset with patients who were under observation for 72 hours. Patient's oxygen levels were measured every 4 seconds but some observations had to be removed due to issues with accuracy of the data. As a result, patients have a varying number of observations.
While patients were observed, they underwent various interventions. The objective of my analysis is to assess whether interventions affected the patient's oxygen levels or not. Therefore, I am doing a comparison of the oxygen levels pre-intervention, during intervention and post-intervention.
While the actual analysis isn't too difficult, I am having a hard time subsetting the data. For example, I only want observations 300 seconds prior to the start of an intervention and 300 seconds post intervention. I have to take into account the fact that an individual may have multiple interventions over the course of the time window and there are multiple subjects.
I have provided some sample code below to generate a dataset but please let me know if I'm missing anything.
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
So far, I have tried this method but I was only able to isolate single observations 5 mins pre and post an intervention and not all observations between those time windows (ie, the single observation 5mins prior to start of an intervention and the single observation 5 mins post an intervention but not all the observations in between these three points)
data <- final4 %>%
filter(intervention == 1) %>%
mutate(five_mins_after = time + 300, #5 mins after intervention
five_mins_before = time - 300) %>% #5 mins before intervention %>%
filter(id == "1")
data2 <- final4 %>%
filter(intervention == 0,
id == "1")
data_after <- data %>%
dplyr::select(five_mins_after)
data_before <- data %>%
dplyr::select(five_mins_before)
data3 <- merge(data2, data_after, by.x = "time", by.y = "five_mins_after")
data4 <- merge(data2, data_before, by.x = "time", by.y = "five_mins_before")
final <- final %>%
dplyr::bind_rows(data3) %>%
dplyr::bind_rows(data4)
Please let me know if you need any additional information and thanks for your time!
PS: Apologies if I missed anything, first time asking for help here
Here is the answer. Although long, it worked fine for gathering times 300 seconds before and 300 after the beggining of an intervention.
Let me know if you'd like further explanation or if I have misunderstood anything.
library(magrittr)
library(tidyverse)
### Sample code
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
### Start of data processing to get wanted observations
# Order it by id and time
final %<>% arrange(id, time)
# Loop over the unique ids
obs_to_keep <- list()
for(i in unique(final$id)) {
# Get starts of treatment
time_zero_intervention <- final %>%
filter(id == i & intervention == 0) %>%
select(time)
# Obtain all times after zero interventions, that could be intervention == 1
time_plus_4 <- time_zero_intervention$time + 4
# Where in the times after 0 intervention there is a 1 intervention
starts_of_interventions <- final %>%
filter(id == i & time %in% time_plus_4) %>%
filter(intervention == 1)
# Loop over each one of the times where intervention starts
all_times <- list()
for(n in 1:length(starts_of_interventions$time)) {
# Gather 300 secs prior and post
time_300_before <- starts_of_interventions$time[n] - 300
time_300_after <- starts_of_interventions$time[n] + 300
# Filter for observations in this interval
all_times[[n]] <- final %>%
filter(id == i) %>%
filter(time >= time_300_before & time <= time_300_after)
}
if(length(all_times) == 1){
obs_to_keep[[i]] <- as.data.frame(all_times)
}
else {
obs_to_keep[[i]] <- do.call(rbind, all_times)
}
}
# Make a data frame from the list
df <- do.call(rbind, obs_to_keep)
# Order it by id and time
df %<>% arrange(id, time)

Why is my R programming 'For' loop so slow?

I'm trying to wrangle USR files (around 7,000) into a Long data format.
I've created the below, but it takes over 2 hours to run (hence the reason for adding the progress printer).
Does anyone have any idea how I can speed up this code? Are there specific lines that are slowing it down?
Thanks in advance!
for(i in D_flows){
flow <- read.table(i, header = F, fill = T, sep = "|")
for(j in flow){
Flow_name <- i
Timestamp <- ymd_hms(flow[flow$V1 == "ZHV",8])
Date <- ymd(flow[flow$V1 == "ZPD",2])
SR <- as.vector(flow[flow$V1 == "ZPD",3])
SP <- as.integer(as.vector(flow[flow$V1 == "SE1",2]))
EV <- as.numeric(as.character(flow[flow$V1 == "SE1" , 4]))
Flow_data <- tibble(Flow_name, Timestamp, Date, SR, SP, EV)
Flow_data <- Flow_data[complete.cases(Flow_data),]
Flow_data <- Flow_data %>%
group_by(SP) %>%
mutate(MEV = sum(EV)) %>%
select(Flow_name, Timestamp, Date, SR, SP, MEV) %>%
unique() %>%
ungroup()
}
#Append the flow data to the D Flow data file
D_flow_data <- bind_rows(D_flow_data, Flow_data)
#Shows the progress of the for loop
progress <- D_flow_data %>%
select(-Timestamp, -Date, -SR, -SP, -MEV) %>%
unique()
print(nrow(progress))
}

Can I omit search results from a dataset in r?

My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",

Using apply to replace nested for loop

My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))

How to auto look up for the start point of largest break in a given dataset in R

I am trying to find the starting point of the largest break of a given data. Here is my example:
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
select <- data[data$countx >= 50,] #take value greater than 50 into account
# find the break
wtym <- select$month
breaks <- c(0, which(diff(wtym) != 1), length(wtym))
allbreak <- sapply(seq(length(breaks) - 1 ),
function(i) wtym[(breaks[i] + 1):breaks[i+1]])
> allbreak
[[1]]
[1] 1 2
[[2]]
[1] 5 6 7 8
The question is: I need to find this for a large number of dataset (and the breaks are obviously varied), is there any way to auto pick up the start point of the largest break in a series (in this example, it is number 5 (gap no.2)? Any idea is highly appreciated. Thanks
Sounds like a run-length-encoding ?rle task where you are looking for runs of x < 50 and x >= 50. Here's a function:
bigbreak <- function(x, cutoff) {
r <- rle(x >= cutoff)
cumsum(r$l)[which(r$l == max(r$l[r$v]) & r$v)-1]+1
}
bigbreak(data$countx, 50)
#[1] 5
Now let's try it on 5 million records:
set.seed(1)
x <- sample(c(50,0), 5e6, replace=TRUE)
system.time({
bigbreak(x, 50)
})
# user system elapsed
# 0.41 0.00 0.41
Under half a second, not too bad.
A solution using dplyr and data.table.
# Create example data frame
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
# Load package
library(dplyr)
library(data.table)
# Process the data
data2 <- data %>%
mutate(Condition = countx >= 50) %>%
mutate(RunID = rleid(Condition)) %>%
filter(Condition) %>%
group_by(RunID) %>%
mutate(num = n()) %>%
ungroup() %>%
filter(num == max(num))
# Show the number of the first month
data2$month[1]
[1] 5

Resources