Join 2 data frames using data.table with conditions - r

I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?

I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.

Related

R limit output of dataframe?

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.
To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

Calculate Pair-Wise Combinations of Factor Vector with Summed Values

I'm trying to generate network graph data from raw occurrence data. In the raw data, I have the occurrence rate of features in a variety of contexts. Let's say it's actors in different movies. Each row is [context, feature, weight], where weight might be amount of screen time. Here's a toy data set:
df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
feature = sample(LETTERS, 500, replace=TRUE),
weight = sample(1:100, 500, replace=TRUE)
)
So for Movie A, we might have 20 rows, where each row is an actor's name and their screen time in that movie.
What I'd like to generate is the pairwise combination of all actors for each movie, with the sum of their respective weights. So for example, if we start with:
[A, A, 5]
[A, B, 2]
I'd like output in the format of [context, feature1, feature2, sum.weight]. So:
[A, A, B, 7]
I know how to run through this with a combination of for loops, but I'd like to know if there is a more "classic R" way of approaching this, particularly with something like data.table.
Here's a possible solution using the data.table package:
library(data.table)
# keep a record of feature's levels
feature.levels <- levels(df$feature)
# for each context, create a data table for all pair combinations of features,
# & sum of said pair's weights
df <- df[,
as.data.table(
cbind(t(combn(feature, 2)),
rowSums(t(combn(weight, 2))))
),
by = context]
# map features (converted into integers in the previous step) back to factors
df[,
c('V1', 'V2') := lapply(.SD,
function(x){factor(x, labels = feature.levels)}),
.SDcols = c('V1', 'V2')]
# rename features / sum weights
setnames(df,
old = c("V1", "V2", "V3"),
new = c("feature1", "feature2", "sum.weights"))
> head(df)
context feature1 feature2 sum.weights
1: C j l 373
2: C j z 282
3: C j v 382
4: C j h 488
5: C j c 280
6: C j u 360
Data (I used lower case for "feature" so that it's visually distinct from upper case "context"):
set.seed(123)
df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
feature = sample(letters, 500, replace=TRUE),
weight = sample(1:100, 500, replace=TRUE))
# convert to data table & summarize to unique combinations by context + feature
setDT(df)
df <- df[,
list(weight = sum(weight)),
by = list(context, feature)]

Filter by ranges supplied by two vectors, without a join operation

I wish to do exactly this: Take dates from one dataframe and filter data in another dataframe - R
except without joining, as I am afraid that after I join my data the result will be too big to fit in memory, prior to the filter.
Here is sample data:
tmp_df <- data.frame(a = 1:10)
I wish to do an operation that looks like this:
lower_bound <- c(2, 4)
upper_bound <- c(2, 5)
tmp_df %>%
filter(a >= lower_bound & a <= upper_bound) # does not work as <= is vectorised inappropriately
and my desired result is:
> tmp_df[(tmp_df$a <= 2 & tmp_df$a >= 2) | (tmp_df$a <= 5 & tmp_df$a >= 4), , drop = F]
# one way to get indices to subset data frame, impractical for a long range vector
a
2 2
4 4
5 5
My problem with memory requirements (with respect to the join solution linked) is when tmp_df has many more rows and the lower_bound and upper_bound vectors have many more entries. A dplyr solution, or a solution that can be part of pipe is preferred.
Maybe you could borrow the inrange function from data.table, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
If you'd like to stick with dplyr it has similar functionality provided through the between function.
# ranges I want to check between
my_ranges <- list(c(2,2), c(4,5), c(6,7))
tmp_df <- data.frame(a=1:10)
tmp_df %>%
filter(apply(bind_rows(lapply(my_ranges,
FUN=function(x, a){
data.frame(t(between(a, x[1], x[2])))
}, a)
), 2, any))
a
1 2
2 4
3 5
4 6
5 7
Just be aware that the argument boundaries are included by default and that cannot be changed as with inrange

Calculate Percentage and other functions using data.table

I want to apply aggregate functions and percentage function to column. I found threads that discuss aggregation (Calculating multiple aggregations with lapply(.SD, ...) in data.table R package) and threads that discuss percentage (How to obtain percentages per value for the keys in R using data.table? and Use data.table to calculate the percentage of occurrence depending on the category in another column), but not both.
Please note that I am looking for data.table based methods. dplyr wouldn't work on actual data set.
Here's the code to generate sample data:
set.seed(10)
IData <- data.frame(let = sample( x = LETTERS, size = 10000, replace=TRUE), numbers1 = sample(x = c(1:20000),size = 10000), numbers2 = sample(x = c(1:20000),size = 10000))
IData$let<-as.character(IData$let)
data.table::setDT(IData)
Here's the code to generate output using dplyr
Output <- IData %>%
dplyr::group_by(let) %>%
dplyr::summarise(numbers1.mean = as.double(mean(numbers1)),numbers1.median = as.double(median(numbers1)),numbers2.mean=as.double(mean(numbers2)),sum.numbers1.n = sum(numbers1)) %>%
dplyr::ungroup() %>%
dplyr::mutate(perc.numbers1 = sum.numbers1.n/sum(sum.numbers1.n)) %>%
dplyr::select(numbers1.mean,numbers1.median,numbers2.mean,perc.numbers1)
Sample Output (header)
If I run head(output), I would get:
let numbers1.mean numbers1.median numbers2.mean perc.numbers1
<chr> <dbl> <dbl> <dbl> <dbl>
N 10320.951 10473.0 9374.435 0.03567927
H 9683.590 9256.5 9328.035 0.03648391
L 10223.322 10226.0 9806.210 0.04005400
S 9922.486 9618.0 10233.849 0.03678742
C 9592.620 9226.0 9791.221 0.03517997
F 10323.867 10382.0 10036.561 0.03962035
Here's what I tried using data.table (unsuccessfully)
IData[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),median=median(x),sum=sum(x))))), by=let, .SDcols=c("numbers1","numbers2")] [,.(Perc = numbers1.sum/sum(numbers1.sum)),by=let]
I have 2 Questions:
a) How can I solve this using data.table?
b) I have seen above threads have used prop.table. Can someone please guide me how to use this function?
I would sincerely appreciate any guidance.
We can use the similar approach with data.table
res <- IData[, .(numbers1.mean = mean(numbers1),
numbers1.median = median(numbers1),
numbers2.mean=mean(numbers2),
sum.numbers1.n = sum(numbers1)), let
][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
][, c("let", "numbers1.mean", "numbers1.median",
"numbers2.mean", "perc.numbers1"), with = FALSE]
head(res)
# let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1: N 10320.951 10473.0 9374.435 0.03567927
#2: H 9683.590 9256.5 9328.035 0.03648391
#3: L 10223.322 10226.0 9806.210 0.04005400
#4: S 9922.486 9618.0 10233.849 0.03678742
#5: C 9592.620 9226.0 9791.221 0.03517997
#6: F 10323.867 10382.0 10036.561 0.03962035

extracting data using dplyr

Say I have the following data
set.seed(123)
a <- c(rep(1,30),rep(2,30))
b <- rep(1:30)
c <- sample(20:60, 60, replace = T)
data <- data.frame(a,b,c)
data
Now I want to extract data whereby:
For each unique value of a, extract/match data where the b value is the same and the c value is within a limit of +-5
so a desired output should produce:
You want to compare within each distinct b group (as they are unique within each a), thus you should group by b. It is also not possible to group by a and compare between them, thus a possible solution would be
data %>%
group_by(b) %>%
filter(abs(diff(c)) <= 5)
with data.table package this would be something like
library(data.table)
setDT(data)[, .SD[abs(diff(c)) <= 5], b]
Or
data[, if (abs(diff(c)) <= 5) .SD, b]
Or
data[data[, abs(diff(c)) <= 5, b]$V1]
In base R it would be something like
data[with(data, !!ave(c, b, FUN = function(x) abs(diff(x)) <= 5)), ]

Resources