Create combinations within a split group in r - r

With the data frame below of Locations, Days, and Quantities, I'm searching for a solution to create combinations of quantities by Location across each Day. In production, these combinations may grow pretty large, so a data.table or plyr approach would be appreciated.
library(gtools)
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5))
The output for this example should be:
Loc Day Qty
1 51 Mon 1
2 51 Tue 3
3 51 Wed 5
4 51 Mon 1
5 51 Tue 4
6 51 Wed 5
7 51 Mon 2
8 51 Tue 3
9 51 Wed 5
10 51 Mon 2
11 51 Tue 4
12 51 Wed 5
I've tried a few nested lapply's which gets me close, but then I'm not sure how to take it to the next step and use the combn() function within each store.
lapply(split(dat, dat$Loc), function(x) {
lapply(split(x, x$Day), function(y) {
y$Qty
})
})
I'm able to get the correct combinations if each Store > Day group was in it's own list, but am struggling how to get there from a data frame using a split-apply-combine method.
loc51_mon <- c(1,2)
loc51_tue <- c(3,4)
loc51_wed <- c(5)
unlist(lapply(loc51_mon, function(x) {
lapply(loc51_tue, function(y) {
lapply(loc51_wed, function(z) {
combn(c(x,y,z), 3)
})
})
}), recursive = FALSE)
[[1]]
[[1]][[1]]
[,1]
[1,] 1
[2,] 3
[3,] 5
[[2]]
[[2]][[1]]
[,1]
[1,] 1
[2,] 4
[3,] 5
[[3]]
[[3]][[1]]
[,1]
[1,] 2
[2,] 3
[3,] 5
[[4]]
[[4]][[1]]
[,1]
[1,] 2
[2,] 4
[3,] 5

This should work however further complexity would require changes to the function:
library(data.table)
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5), stringsAsFactors = F)
setDT(dat)
comb_in <- function(Qty_In,Day_In){
temp_df <- aggregate(Qty_In ~ Day_In, cbind(Qty_In, as.character(Day_In)), paste, collapse = "|")
temp_list <- strsplit(temp_df$Qty_In, split = "|", fixed = T)
names(temp_list) <- as.character(temp_df$Day)
melt(as.data.table(expand.grid(temp_list))[, case_group := .I], id.vars = "case_group", variable.name = "Day", value.name = "Qty")
}
dat[, comb_in(Qty_In = Qty, Day_In = Day), by = Loc][order(Loc,case_group,Day)]
Loc case_group Day Qty
1: 51 1 Mon 1
2: 51 1 Tue 3
3: 51 1 Wed 5
4: 51 2 Mon 2
5: 51 2 Tue 3
6: 51 2 Wed 5
7: 51 3 Mon 1
8: 51 3 Tue 4
9: 51 3 Wed 5
10: 51 4 Mon 2
11: 51 4 Tue 4
12: 51 4 Wed 5
You can now filter by case_group to get each combination

this question is quite similar to How to expand.grid on vectors sets rather than single elements
for a general approach (performance likely to be slower than a problem specified approach):
permu.sets <- function(listoflist) {
#assumes that each list within listoflist contains vectors of equal lengths
temp <- expand.grid(listoflist)
do.call(cbind, lapply(temp, function(x) do.call(rbind, x)))
} #permu.sets
#for the problem posted in OP
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5))
vecsets <- lapply(split(dat, dat$Day), function(x) split(as.matrix(x), row(x)))
res <- permu.sets(vecsets)
lapply(split(res, seq(nrow(res))), function(x) matrix(x, ncol=3, byrow=T ))

Related

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

Identify sequences between identical values

I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)
Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34
Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30

Adding Proportionate Na's in a column [duplicate]

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))

Create an update function

I would like to create an update function using lazy evaluation and the mutate_if function from dplyrExtras by skranz.
It would work something like this:
data %>%
update(variable1_original = variable1_update,
variable2_original = variable2_update)
would be evaluated as
data %>%
mutate_if(!is.na(variable1_update),
variable1_original = variable1_update) %>%
mutate_if(!is.na(variable2_update),
variable2_original = variable2_update) %>%
select(-variable1_update, variable2_update)
Yikes, that package isn't very fun to use. mutate_if doesn't seem to work with data.frames and the package doesn't have standard-evaluation alternatives for functions like standard dplyr does. Here's an attempt to re-create the function
myupdate <- function(.data, ...) {
dots <- as.list(substitute(...()))
dx <- Reduce(function(a,b) {
upd <- b[[1]]
ifc <- bquote(!is.na(.(upd)))
do.call("mutate_if", c(list(a, ifc), b))
}, split(dots, seq_along(dots)), .data)
select_(dx, .dots=sapply(dots, function(x) bquote(-.(x))))
}
To test it, i used
library(data.table)
dd<-data.table(
a = c(1:3, NA, 5:8)+0,
b = c(1:2, NA, 4:5, NA, 7:8)+100,
x= 1:8+20,
y=1:8+30
)
dd
# a b x y
# 1: 1 101 21 31
# 2: 2 102 22 32
# 3: 3 NA 23 33
# 4: NA 104 24 34
# 5: 5 105 25 35
# 6: 6 NA 26 36
# 7: 7 107 27 37
# 8: 8 108 28 38
and then I ran
myupdate(dd, x=b, y=a)
# x y
# 1: 101 1
# 2: 102 2
# 3: 23 3
# 4: 104 34
# 5: 105 5
# 6: 26 6
# 7: 107 7
# 8: 108 8
Notice how columns "a" and "b" disappear. Also see how values in rows 3 and 6 in column "x" and the value in row 4 in column "y" was preserved because the corresponding values in columns "b" and "a" were NA.

randomly insert sequence of missing data (NAs) [duplicate]

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))

Resources