Using apply to replace nested for loop - r
My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))
Related
Median Split of one variable to create another variable
I am currently struggling with a median split in R studio. I wish to create a new column in my data frame which is a median split of another, however, I do not know how this can be accomplished. Any and all help will be appreciated. this is the code I have previously run: medianpcr <- median(honourswork$PCR.x) highmedian <- filter(honourswork, PCR.x <= medianpcr) lowmedian <- filter(honourswork, PCR.x > medianpcr)
When you post a question on SO, it's always a good idea to include an example dataframe so that the answerer doesn't have to create one themselves. Onto your question, if I understand you correctly, you can use the mutate() and case_when() from the dplyr package: # Load the dplyr library library(dplyr) # Create an example dataframe data <- data.frame( rowID = c(1:20), value = runif(20, 0, 50) ) # Use case_when to mutate a new column 'category' with values based on # the 'value' column data2 <- data %>% dplyr::mutate(category = dplyr::case_when( value > median(value) ~ "Highmedian", value < median(value) ~ "Lowmedian", value == median(value) ~ "Median" ) ) More about case_when() here. Hope this helps!
Let's first create some data: set.seed(123) honourswork <- data.frame(PCR.x = rnorm(100)) In dplyr, you might do: library(tidyverse) honourswork %>% mutate(medianpcr = median(PCR.x)) %>% mutate(highmedian = ifelse(PCR.x > medianpcr, 1, 0)) -> honourswork honourswork %>% mutate(medianpcr = median(PCR.x)) %>% mutate(lowmedian = ifelse(PCR.x <= medianpcr, 1, 0)) -> honourswork Equivalently in base R: honourswork$highmedian <- 0 honourswork$highmedian[honourswork$PCR.x > median(honourswork$PCR.x)] <- 1 honourswork$lowmedian <- 0 honourswork$lowmedian[honourswork$PCR.x <= median(honourswork$PCR.x)] <- 1
Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?
Purpose I am trying to check whether a pair of values in two columns appear in the previous event, and aggregate the dummy variables by groups. Specifically, I have each event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in). (1)I made a dummy variable indicating if a pair of agent and partner appear together in the previous event. (2) Also, I used ddply to aggregate the dummy variable by groups, as specified in the below example. I find that ddply and lag functions take so much time with a large dataset, and I am wondering if there is a faster way to achieved these tasks. Dataset library(tidyverse) library(tibble) rename <- dplyr::rename select <- dplyr::select set.seed(10001) cases <- sample(1:5, 1000, replace=T) set.seed(10002) agent <- sample(1:20, 1000, replace=T) set.seed(10003) partner <- sample(1:20, 1000, replace=T) set.seed(123) n <- 1000 # no of random datetimes needed minDate <- as.POSIXct("1999/01/01") maxDate <- as.POSIXct("2000-01-01") epoch <- "1970-01-01" timestamps <- as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch) df <- data.frame(cases, agent, partner, timestamps) %>% rename( aid = agent, pid = partner, oid = cases, o4.in = timestamps ) %>% filter(aid != pid) Current Methods # creating dummy variable d <- df %>% arrange(o4.in) %>% group_by(aid) %>% mutate( oid.lag.a = lag(oid) ) %>% ungroup %>% group_by(pid) %>% mutate( oid.lag.p = lag(oid) ) %>% ungroup %>% mutate( j2.consecutive = ifelse(oid.lag.a == oid.lag.p, 1, 0), j2.consecutive = ifelse(is.na(j2.consecutive), 0, j2.consecutive) ) %>% select(-oid.lag.a, -oid.lag.p) # aggregating the dummy variable by groups t <- d %>% ungroup %>% ddply(c('oid', 'aid'), function(i){ i %>% mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1)) } , .progress = 'text') %>% arrange(oid, pid) %>% ddply(c('oid', 'pid'), function(i){ i %>% mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1)) } , .progress = 'text') Update for Future Readers Task (1) is achieved by the answer by #akrun below. Task (2) solution is answered by #akrun in a separate post: A faster way than applying 'ddply' to aggregate a variable by a function by groups Special thanks to #akrun!!
We can use data.table methods to make it faster library(data.table) df2 <- copy(df) df3 <- setDT(df2)[order(o4.in)] df3[, oid.lag.a := shift(oid), by = aid ][, oid.lag.p := shift(oid), by = pid] df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)] Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce library(dplyr) out <- df %>% arrange(o4.in) %>% group_by(aid) %>% mutate( oid.lag.a = lag(oid) ) %>% group_by(pid) %>% mutate( oid.lag.p = lag(oid) ) %>% ungroup %>% mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0)) -checking the output from dplyr/data.table all(out$j2.consecutive == df3$j2.consecutive ) [1] TRUE
Can I omit search results from a dataset in r?
My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search? Something like: everything_except <- df %>% filter(x == "something complex") %>% omit() My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way. Update with example: I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases. set.seed(123) event_df <- tibble(time_sec = c(1:120)) %>% sample_n(100) %>% mutate(period = sample(c(1,2,3), size = 100, replace = TRUE), event = sample(c("A","B"), size = 100, replace = TRUE, prob = c(0.1,0.9))) %>% select(period, time_sec, event) %>% arrange(period, time_sec) filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){ condition <- enquo(condition) time <- enquo(time) filtered <- .data %>% slice(., 1:max(which(!!condition))) %>% group_by(., grp = lag(cumsum(!!condition), default = 0)) %>% filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) - !!time) >= lag_time) return(filtered) } # this returns 23 rows of data. I would like to return everything except this data event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0) # final output should be 77 rows starting with... # ~period, ~time_sec, ~event, # 1,3,"B", # 1,4,"B", # 1,5,"B",
Improve efficiency of lookup algorithm in R
I think this is an interesting task to optimize a piece of R code. I have a dataframe df_red which details from orders of a webshop. For each product (ean), I want to get the 12 most likely other products to be in a basket with it. This is the sample code to generate such data set: library(tidyverse) # create a vector with 1400 products (characterized by their EANs) eans <- sample(1e5:1e6, 1400, replace = FALSE) # create a vector with 200k orders basket_nr <- 1:2e5 # a basket can have up to 4 items, it's most likely to have 3 items n_prod_per_basket <- sample(x = 1:4, length(basket_nr), prob = c(0.2, 0.2, 0.5, 0.1), replace = TRUE) # create df_red, each line of which correspond to a product with it's respective basket number df <- data_frame(basket_nr, n_prod_per_basket) df_red <- data_frame(basket_nr = rep(basket_nr, n_prod_per_basket)) df_red$ean <- sample(x = eans, nrow(df_red), replace = TRUE) The code I am using to accomplish this task is the following. But I am sure it's not an efficient one. How can I increase the speed of the program? ean <- unique(df_red$ean) out <- list() for (i in 1:length(ean)){ ean1 <- ean[i] # get all basket_nr that contain the ean in question basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr # get products that were together in the same basket with the ean in question boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr) prod <- df_red[boo, ] # get top most frequent top12 <- prod %>% group_by(ean) %>% summarise(n = n()) %>% arrange(desc(n)) %>% filter(row_number() %in% 1:12) # skip products that weren't together in a basket with at least 12 different other products if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n) if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete')) }
Performance improvements are of course a matter of degree. How far to go before it is improved "enough" is hard to say. However, we can reduce run time by about 25% by functionalizing your code and cleaning up the subsetting logic. Starting with your code: #added a timer start.time <- Sys.time() for (i in 1:length(ean)){ ean1 <- ean[i] # get all basket_nr that contain the ean in question basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr # get products that were together in the same basket with the ean in question boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr) prod <- df_red[boo, ] # get top most frequent top12 <- prod %>% group_by(ean) %>% summarise(n = n()) %>% arrange(desc(n)) %>% filter(row_number() %in% 1:12) # skip products that weren't together in a basket with at least 12 different other products if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n) if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete')) } Sys.time() - start.time This takes between 30-34 seconds on my machine. However we can rewrite it as a function like so: my.top12.func <- function(id, df_red) { #improved subsetting logic - using which is faster and we can remove some code by #removing the ean that is being iterated in the filter step below prod <- df_red[df_red$basket_nr %in% df_red$basket_nr[which(df_red$ean == id)], ] # set cutoff from 12 to 13 since the specific ean will always be one of the top 12 top12 <- prod %>% group_by(ean) %>% summarise(n = n()) %>% arrange(desc(n)) %>% filter(row_number() %in% 1:13 & ean != id) #additional filter required # skip products that weren't together in a basket with at least 12 different other products if(nrow(top12) == 12) return(data_frame(ean = id, recom = top12$ean, freq = top12$n)) } Now we can test the speed and accuracy of this approach by doing: start.time <- Sys.time() my.out <- lapply(ean, my.top12.func, df_red = df_red) Sys.time() - start.time #test for equality all.equal(out, my.out) Which is about 24-26 seconds for a 25%+ improvement.
Playing around with data.table I have the output produced in under 7 seconds (which I guess is about 80% improvement): library(data.table) setDT(df_red) all_eans <- df_red[, unique(ean)] k <- lapply(all_eans, function(x) { df_red[basket_nr %in% df_red[ean == x, unique(basket_nr)], .N, by = ean][order(-N)][2:13] } ) names(k) <- all_eans k <- k[sapply(k, nrow) == 12]
I would consider not using a loop. df_red$k <- 1 df_s <- left_join(df_red, df_red, by = "k") %>% filter(ean.x != ean.y & basket_nr.x == basket_nr.y) %>% group_by(ean.x) %>% summarise(n = n()) %>% arrange(desc(n)) %>% filter(row_number() %in% 1:13) df_s.ct <- df_s %>% filter(row_number() == 12) df_s.fin <- df_s[df_s$ean.x %in% df_s.ct$ean.x, ] The rate limiting step in this is the left_join which merges the dataset to itself, creating an exponentially larger dataset (so if you have 50,000 points, then you will end up creating a new dataset that is 2.5B points). It now indicates that the best way to store and manipulate the data is using data.table, which will increase the speed of this procedure, especially when combined with dplyr.
Calling recursive functions in R
Assuming I have a dataframe, df with this info group wk source revenue 1 1 C 100 1 1 D 200 1 1 A 300 1 1 B 400 1 2 C 500 1 2 D 600 I'm trying to programatically filter's down to rows of unique combinations of group, wk and source, and then perform some operations on them, before combining them back into another dataframe. I want to write a function that can scale to any number of segments (and not just the example scenario here) and filter down rows. All I need to pass would be the column names by which I want to segment eg. seg <- c("group", "wk", "source") One unique combination to filter rows in df would be df %>% filter(group == 1 & wk == 1 & source == "A") I wrote a recursive function (get_rows) to do so, but it doesn't seem to do what I want. Could anyone provide inputs on where I'm going wrong ? library(dplyr) filter_row <- function(df,x) { df %>% filter(group == x$group & wk == x$wk & source == x$source) } seg <- c("group", "wk", "source") get_rows <- function(df,seg,pos = 1, l = list()) { while(pos <= (length(seg) + 1)) { if(pos <= length(seg)) for(j in 1:length(unique(df[,seg[pos]]))) { k <- unique(df[,seg[pos]]) l[seg[pos]] <- k[j] get_rows(df,seg,pos+1,l) return() } if(pos > length(seg)) { tmp <- df %>% filter_row(l) <call some function on tmp> return() } } } get_rows(df,seg) EDIT: I understand there are prebuilt methods I can use to get what I need, but I'm curious about where I'm going wrong in the recursive function I wrote.
There might be a data.table/dplyr solution out there, but this one is pretty simple. # Just paste together the values of the column you want to aggregate over. # This creates a vector of factors f <- function(data, v) {apply(data[,v,drop=F], 1, paste, collapse = ".")} # Aggregate, tapply, ave, and a few more functions can do the same thing by(data = df, # Your data here INDICES = f(df, c("group", "wk", "source")), # Your data and columns here FUN = identity, simplify = F) # Your function here Can also use library(dplyr) and library(data.table) df %>% data.table %>% group_by(group, wk, source) %>% do(yourfunctionhere, use . for x)