I have two data frames (table1 and randomdata) with the following schema:
#randomdata
randomdata$cube = {1,5,3,3,4,5,5,2,2,6,1,2,....} (1000 rows)
#table1
table1$side = {1,2,3,4,5,6} (6 rows)
table1$frequency = NULL
I want to count the occurence from the different sides of the cube (of the first 10 rows from randomdata$cube) and assign the result to table1$frequency to the corresponding row (based on table1$side).
I can do this successfuly this way:
table1$frequency[1] <- sum(randomdata$cube[1:10] == 1)
table1$frequency[2] <- sum(randomdata$cube[1:10] == 2)
table1$frequency[3] <- sum(randomdata$cube[1:10] == 3)
...
table1$frequency[6] <- sum(randomdata$cube[1:10] == 6)
This works very well, but there must be a better way.
Instead of 6 statements, I imagine something like this:
table1$frequency <- sum(randomdata$cube[1:10] == table1$side)
Can someone show me a more dynamic way to do this?
Thank you.
We can do this with converting the 'cube' column to factor with levels specified as 1:6 and then do the table. If we do it without that, missing elements can get dropped out of the table output. Here, it would be 0 if a level is missing
table1$frequency <- table(factor(randomdata$cube[1:10], levels = 1:6))
Or using tidyverse
library(tidyverse)
randomdata %>%
slice(1:6) %>%
count(cube = factor(cube, levels = 1:6), .drop = FALSE) %>%
pull(n) %>%
mutate(table1, frequency = .)
My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))
I think this is an interesting task to optimize a piece of R code.
I have a dataframe df_red which details from orders of a webshop. For each product (ean), I want to get the 12 most likely other products to be in a basket with it.
This is the sample code to generate such data set:
library(tidyverse)
# create a vector with 1400 products (characterized by their EANs)
eans <- sample(1e5:1e6, 1400, replace = FALSE)
# create a vector with 200k orders
basket_nr <- 1:2e5
# a basket can have up to 4 items, it's most likely to have 3 items
n_prod_per_basket <- sample(x = 1:4, length(basket_nr), prob = c(0.2, 0.2, 0.5, 0.1), replace = TRUE)
# create df_red, each line of which correspond to a product with it's respective basket number
df <- data_frame(basket_nr, n_prod_per_basket)
df_red <- data_frame(basket_nr = rep(basket_nr, n_prod_per_basket))
df_red$ean <- sample(x = eans, nrow(df_red), replace = TRUE)
The code I am using to accomplish this task is the following. But I am sure it's not an efficient one. How can I increase the speed of the program?
ean <- unique(df_red$ean)
out <- list()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}
Performance improvements are of course a matter of degree. How far to go before it is improved "enough" is hard to say. However, we can reduce run time by about 25% by functionalizing your code and cleaning up the subsetting logic. Starting with your code:
#added a timer
start.time <- Sys.time()
for (i in 1:length(ean)){
ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr
# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]
# get top most frequent
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:12)
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)
if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))
}
Sys.time() - start.time
This takes between 30-34 seconds on my machine. However we can rewrite it as a function like so:
my.top12.func <- function(id, df_red) {
#improved subsetting logic - using which is faster and we can remove some code by
#removing the ean that is being iterated in the filter step below
prod <- df_red[df_red$basket_nr %in% df_red$basket_nr[which(df_red$ean == id)], ]
# set cutoff from 12 to 13 since the specific ean will always be one of the top 12
top12 <- prod %>%
group_by(ean) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13 & ean != id) #additional filter required
# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) return(data_frame(ean = id, recom = top12$ean, freq = top12$n))
}
Now we can test the speed and accuracy of this approach by doing:
start.time <- Sys.time()
my.out <- lapply(ean, my.top12.func, df_red = df_red)
Sys.time() - start.time
#test for equality
all.equal(out, my.out)
Which is about 24-26 seconds for a 25%+ improvement.
Playing around with data.table I have the output produced in under 7 seconds (which I guess is about 80% improvement):
library(data.table)
setDT(df_red)
all_eans <- df_red[, unique(ean)]
k <- lapply(all_eans, function(x) {
df_red[basket_nr %in% df_red[ean == x, unique(basket_nr)],
.N,
by = ean][order(-N)][2:13]
}
)
names(k) <- all_eans
k <- k[sapply(k, nrow) == 12]
I would consider not using a loop.
df_red$k <- 1
df_s <- left_join(df_red, df_red, by = "k") %>%
filter(ean.x != ean.y & basket_nr.x == basket_nr.y) %>%
group_by(ean.x) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(row_number() %in% 1:13)
df_s.ct <- df_s %>% filter(row_number() == 12)
df_s.fin <- df_s[df_s$ean.x %in% df_s.ct$ean.x, ]
The rate limiting step in this is the left_join which merges the dataset to itself, creating an exponentially larger dataset (so if you have 50,000 points, then you will end up creating a new dataset that is 2.5B points). It now indicates that the best way to store and manipulate the data is using data.table, which will increase the speed of this procedure, especially when combined with dplyr.
I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4