I have a 25 years data set that looks similar to the following:
date name value tag
1 2014-12-01 f -0.338578654 12
2 2014-12-01 a 0.323379254 4
3 2014-12-01 f 0.004163806 9
4 2014-12-01 f 1.365219477 2
5 2014-12-01 l -1.225602543 7
6 2014-12-01 d -0.308544089 9
This is how to replicate it:
set.seed(9)
date <- rep(seq(as.Date("1990-01-01"), as.Date("2015-01-1"), by="months"), each=50)
N <- length(date)
name <- sample(letters, N, replace=T)
value <- rnorm(N)
tag <- sample(c(1:50), N, replace=T)
mydata <- data.frame(date, name, value, tag)
head(mydata)
I would like to create a new matrix that stores values that satisfy multiple criteria. For instance, the sum of values that have a name j and a tag i. I use two for-loops and the which() function to filter out the correct values. Like this:
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
This is ok for small data sets, but too slow for larger ones. Is it possible to avoid the for-loops or somehow speed up the process?
You can use dcast from package reshape2, with a custom function to sum your values:
library(reshape2)
dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
Or simply xtabs, base R:
xtabs(value~name+tag, mydata)
Some benchmark:
funcPer = function(){
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
}
colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
colonel2 = function() xtabs(value~name+tag, mydata)
#> system.time(colonel1())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(colonel2())
# user system elapsed
# 0.05 0.00 0.05
#> system.time(funcPer())
# user system elapsed
# 4.67 0.00 4.82
Related
I have a dataset that looks like the below:
> head(mydata)
id value1 value2
1: 1 200001 300001
2: 2 200002 300002
3: 3 200003 300003
4: 4 200004 300004
5: 5 200005 300005
6: 6 200006 300006
value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).
After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).
Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.
Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!
mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
Just use the formula for linear interpolation to vectorize over the whole data.table.
mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
system.time({
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
})
#> user system elapsed
#> 41.50 0.53 42.05
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0 0 0
identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE
It also works just as fast when m is a vector.
m <- sample(12, 1e6 + 1, 1)
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0.01 0.00 0.02
I am trying to create cross tables using weights::wtd.chi.sq.
The data:
data_in <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt1 wgt2
never always Yes 2 High M 1.3 0.4
sometimes always No 2 Med F 0.4 0.2
always sometimes Yes 4 Low F 1.2 0.7
never never No 2 High M 0.5 0.7
always always No 4 High M 0.7 0.8
sometimes never Yes 3 Low F 0.56 0.3
sometimes never Yes 2 Med F 0.9 0.1
")
x_tab function that feeds into another function:
xtab_func <- function(dat, col, target, wgt){
col <- rlang::as_string(ensym(col))
target <- rlang::as_string(ensym(target))
wgt <- rlang::as_string(ensym(wgt))
wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
Running it gives:
xtab_func(data_in, 'Q50_1','pov','wgt1')
Chisq df p.value
7.3395092 4.0000000 0.1189981
Now I am looping through a vector of columns to repeat this for tabulation for each column. The error happens when I try to call target and wgt within the xtab function above. I've tried 3 different ways but none of them work.
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
# browser()
target <- rlang::as_string(ensym(target))
print(target)
wgt <- rlang::as_string(ensym(wgt))
target_name <- enquo(target_name)
school_type <- enquo(school_type)
d <- list()
for (i in columns){
# OPTION 1
# x <- xtab_func(dat, i, !!target, !!wgt)
# OPTION 2
x <- xtab_func(dat, i, target, wgt)
# OPTION 3
# x <- xtab_func(dat, i, dat[[target]],dat[[wgt]])
x$i <- i
d[[i]] <- x
df <- do.call(rbind, d)
}
return(df)
}
When I run this I could see the chi values for the columns by pov...
cols <- data_in %>% select(starts_with("Q"))
cols <- names(cols)
crosstab(data_in,'pov',cols, 'wgt1', 'pov','trad')
But I get these errors:
Error: Only strings can be converted to symbols
OR
Error in model.frame.default(formula = weight ~ var1 + var2) :
invalid type (NULL) for variable 'var1'
Any idea how I call those variables? Thank you!
It's not clear to me at all why you are trying to use all the rlang stuff when you are just passing character values to your functions. This could be greatly simplified to
xtab_func <- function(dat, col, target, wgt){
weights::wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
and
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
d <- list()
for (i in columns){
x <- as.data.frame(as.list(xtab_func(dat, i, target, wgt)))
x$i <- i
d[[i]] <- x
}
df <- do.call(rbind, d)
return(df)
}
Just use [[]] with character values to index into your data.
With the OP's default function xtab_func, we can modify the crosstab to
library(purrr)
library(dplyr)
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
purrr::map_dfr(columns, ~ {
xtab_func(dat, !!.x, !!target, !!wgt)
})
}
-testing
crosstab(data_in,'pov', cols, 'wgt1', 'pov','trad')
# A tibble: 4 x 3
Chisq df p.value
<dbl> <dbl> <dbl>
1 7.34 4 0.119
2 6.02 4 0.198
3 1.47 2 0.480
4 4.83 4 0.306
I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
f1 is about 40% faster than the original solution for 10.000 rows.
system.time(f1(df))
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
library(data.table)
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
if(cond[i])
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
# for 10000 rows
system.time(f2(df))
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
setDT(df)
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
}
For 1M rows
system.time(f3(df))
# user system elapsed
# 0.05 0.05 0.09
system.time(f2(df))
# user system elapsed
# 1.83 0.05 1.87
I have the following data.frame (which is longer then the following example)
sub height group
1 1.55 a
2 1.65 a
3 1.76 b
4 1.77 a
5 1.58 c
6 1.65 d
7 1.82 c
8 1.91 c
9 1.77 b
10 1.69 b
11 1.74 a
12 1.75 c
Im making a data partition with the following code:
library("caret")
train = createDataPartition(df$group, p = 0.50)
partition = df[train, ]
So it takes a subject with the probability of 0.5 from each group.
My problem is in this following example is that sometimes a subject from group d will be picked and sometimes not (because group d is really small). I want to create a constraint that in every partition I make, atlist 1 subject from EVERY group will be picked.
Any graceful solution?
I came up with a not-so graceful solution looking like this:
allGroupSamles <- c()
for (i in unique(df$groups))
{
allGroupSamles <- c(allGroupSamles , sample(rownames(df[df$groups == i, ]) , 1, replace = TRUE))
}
allGroupSamles <- as.integer(allGroupSamles )
train = createDataPartition(df$groups, p = 0.50)[[1]]
train <- c(allGroupSamles , train)
partition= df[unique(train), ]
You can use split on a data.frame and sample within each group taking half of the records or 1, whichever is greater:
# apply a function over the split data.frame
samples <- lapply(split(df, df$group), function(x) {
# the function takes a random sample of half the records in each group
# by using `ceiling`, it guarantees at least one record
s <- sample(nrow(x), ceiling(nrow(x)/2))
x[s,]
})
train <- do.call(rbind, samples)
Edit:
If you need a numeric vector:
s <- tapply(1:nrow(df), df$group, function(x) {
sample(x, ceiling(length(x)/2))
})
do.call(c, s)
Long time lurker, first time asker.
I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.
#serially numbered items
parents <- rep(1:10000)
#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))
#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
if (numchild[x]>0){
f1 <- sample(1:length(parents), size=numchild[x])
f2 <- list(parents[f1])
parent_child <- c(parent_child, f2)
}
else {
parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something
}
}
Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.
I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.
I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.
I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.
#small subset
a <- parent_child[1:1000]
outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.
Is there a way to make this scale?
I reversed the split, so that we have a child-parent relationship
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
Something like the following constructs a string with pairs of parents sharing a child, across all children
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
and tallying
table(unlist(int, use.names=FALSE))
or a little more quickly
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)
for
f1 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
with (this is for all 10000 parent-child elements)
> system.time(ans1 <- f1(parent_child))
user system elapsed
14.625 0.012 14.668
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806
1 1 1 1 1 1
I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child.
One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times combn is called to the number of unique lengths of elements of child_parent.
combn1 <- local({
memo <- new.env(parent=emptyenv())
function(x) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
}
})
f2 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn1)
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
which helps somewhat
> system.time(ans2 <- f2(parent_child))
user system elapsed
5.337 0.000 5.347
> identical(ans1, ans2)
[1] TRUE
The slow part is now paste
> Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
self.time self.pct total.time total.pct
"paste" 3.92 73.41 3.92 73.41
"match" 0.74 13.86 0.74 13.86
"unique.default" 0.40 7.49 0.40 7.49
"as.character" 0.08 1.50 0.08 1.50
"unlist" 0.08 1.50 0.08 1.50
"combn" 0.06 1.12 0.06 1.12
"lapply" 0.02 0.37 4.00 74.91
"any" 0.02 0.37 0.02 0.37
"setNames" 0.02 0.37 0.02 0.37
$by.total
...
We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21
encode <- function(x, y, n)
(x - 1) * (n + 1) + y
decode <- function(z, n)
list(x=ceiling(z / (n + 1)), y = z %% (n + 1))
and adjusting our combn1 and f2 functions as
combn2 <- local({
memo <- new.env(parent=emptyenv())
function(x, encode_n) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
}
})
f3 <- function(parent_child) {
encode_n <- length(parent_child)
len <- sapply(parent_child, length)
child_parent <-
unname(split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE)))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn2, encode_n)
id <- unlist(int, use.names=FALSE)
uid <- unique(xx)
n <- tabulate(match(xx, uid), length(uid))
do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}
leading to
> system.time(f3(parent_child))
user system elapsed
2.140 0.000 2.146
This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer
> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
2.465 0.000 2.468
> system.time(f3(parent_child[1:99]))
user system elapsed
0.016 0.000 0.014
and scales in a much more reasonable way.
For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. Avoid the need for the complicated conditional in the for loop by fixing the issue before-hand
numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
or by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is then
n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)
Here is another approach that is about 10X faster than my previous answer, and 17X faster than the original code (also simpler):
ff <- function(u2, u1, a) {
common <- length(intersect(a,parent_child[[u2]]))
if (common>0) {return(data.frame(u1,u2,common))}
}
gg <- function(u1) {
a <- parent_child[[u1]]
do.call("rbind",lapply((u1+1):100,ff,u1,a))
}
system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
1.04 0.00 1.03
result.3 is identical to result.2 from previous answer:
max(abs(result.3-result.2))
[1] 0
Well, a small improvement (I think):
Original code (wrapped in function call):
f = function(n) {
#small subset
a <- parent_child[1:n]
outerresults <- foreach (i = 1:(length(a)),
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
return(outerresults)
}
Modified code:
g <- function(n) {
a <- parent_child[1:n]
outerresults <- foreach (i = 1:n,
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
foreach (j = (i):n, .combine=rbind) %dopar% {
if (i!=j) {
c <- a[[j]]
common <- length(intersect(b, c))
if (common > 0) {g <- data.frame(u1=i, u2=j, common)}
}
}
}
return(outerresults)
}
Benchmarks:
system.time(result.old<-f(100))
user system elapsed
17.21 0.00 17.33
system.time(result.new<-g(100))
user system elapsed
10.42 0.00 10.47
The numbering for u2 is a little different becasue of the different approaches, but both produce the same vector of matches:
max(abs(result.old$common-result.new$common))
[1] 0
I tried this with data table joins replacing intersect(...) and it was actually much slower(!!)