Cartesian product with dplyr - r

I'm trying to find the dplyr function for cartesian product. I've two simple data.frame with no common variable:
x <- data.frame(x = c("a", "b", "c"))
y <- data.frame(y = c(1, 2, 3))
I would like to reproduce the result of
merge(x, y)
x y
1 a 1
2 b 1
3 c 1
4 a 2
5 b 2
6 c 2
7 a 3
8 b 3
9 c 3
I've already looked for this (for example here or here) without finding anything useful.

Use crossing from the tidyr package:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
crossing(x, y)
Result:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3

When x and y are database tbls (tbl_dbi / tbl_sql) you can now also do:
full_join(x, y, by = character())
Added to dplyr at the end of 2017, and also gets translated to a CROSS JOIN in the DB world. Saves the nastiness of having to introduce the fake variables.
I'm seeing comments now (Nov2022) that this does also work on standard dataframes! Great news!

If we need a tidyverse output, we can use expand from tidyr
library(tidyverse)
y %>%
expand(y, x= x$x) %>%
select(x,y)
# A tibble: 9 × 2
# x y
# <fctr> <dbl>
#1 a 1
#2 b 1
#3 c 1
#4 a 2
#5 b 2
#6 c 2
#7 a 3
#8 b 3
#9 c 3

When faced with this problem, I tend to do something like this:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
x %>% mutate(temp=1) %>%
inner_join(y %>% mutate(temp=1),by="temp") %>%
dplyr::select(-temp)
If x and y are multi-column data frames, but I want to do every combination of a row of x with a row of y, then this is neater than any expand.grid() option that I can come up with

expand.grid(x=c("a","b","c"),y=c(1,2,3))
Edit: Consider also this following elegant solution from "Y T" for n more complex data.frame :
https://stackoverflow.com/a/21911221/5350791
in short:
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expand.grid.df(df1, df2, df3)

This is a continuation of dsz's comment. Idea came from: http://jarrettmeyer.com/2018/07/10/cross-join-dplyr.
tbl_1$fake <- 1
tbl_2$fake <- 1
my_cross_join <- full_join(tbl_1, tbl_2, by = "fake") %>%
select(-fake)
I tested this on four columns of data ranging in size from 4 to 640 obs, and it took about 1.08 seconds.

Using two answers above, using full_join() with by = character() seems to be faster:
library(tidyverse)
library(microbenchmark)
df <- data.frame(blah = 1:10)
microbenchmark(diamonds %>% crossing(df))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% crossing(df) 21.70086 22.63943 23.72622 23.01447 24.25333 30.3367 100
microbenchmark(diamonds %>% full_join(df, by = character()))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% full_join(df, by = character()) 9.814783 10.23155 10.76592 10.44343 11.18464 15.71868 100

Related

Row Minimum except certain columns

I have a data frame below. I need to find the the row min and max except few column that are characters.
df
x y z
1 1 1 a
2 2 5 b
3 7 4 c
I need
df
x y z Min Max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
Another dplyr possibility could be:
df %>%
mutate(Max = do.call(pmax, select_if(., is.numeric)),
Min = do.call(pmin, select_if(., is.numeric)))
x y z Max Min
1 1 1 a 1 1
2 2 5 b 5 2
3 7 4 c 7 4
Or a variation proposed be #G. Grothendieck:
df %>%
mutate(Min = pmin(!!!select_if(., is.numeric)),
Max = pmax(!!!select_if(., is.numeric)))
Another base R solution. Subset only the columns with numbers and then use apply in each row to get the minimum and maximum value with range.
cbind(df, t(apply(df[sapply(df, is.numeric)], 1, function(x)
setNames(range(x, na.rm = TRUE), c("min", "max")))))
# x y z min max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
1) This one-liner uses no packages:
transform(df, min = pmin(x, y), max = pmax(x, y))
giving:
x y z min max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
2) If you have many columns and don't want to list them all or determine yourself which are numeric then this also uses no packages.
ix <- sapply(df, is.numeric)
transform(df, min = apply(df[ix], 1, min), max = apply(df[ix], 1, max))
If your actual data has NAs and if you want to ignore them when taking the min or max then min, max, pmin and pmax all take an optional na.rm = TRUE argument.
Note
Lines <- "x y z
1 1 1 a
2 2 5 b
3 7 4 c"
df <- read.table(text = Lines)
1) We can use select_if. Here, we can use select_if to select the columns that are numeric, then with pmin, pmax get the rowwise min and max and bind it with the original dataset
library(dplyr)
library(purrr)
df %>%
select_if(is.numeric) %>%
transmute(Min = reduce(., pmin, na.rm = TRUE),
Max = reduce(., pmax, na.rm = TRUE)) %>%
bind_cols(df, .)
# x y z Min Max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
NOTE: Here, we use only a single expression of select_if
2) The same can be done in base R (no packages used)
i1 <- names(which(sapply(df, is.numeric)))
df['Min'] <- do.call(pmin, c(df[i1], na.rm = TRUE))
df['Max'] <- do.call(pmax, c(df[i1], na.rm = TRUE))
Also, as stated in the comments, this is generalized option. If it is only for two columns, just doing pmin(x, y) or pmax(x,y) is possible and that wouldn't check if the columns are numeric or not and it is not a general solution
NOTE: All of the solutions mentioned here are either answered first or from the comments with the OP
data
df <- structure(list(x = c(1L, 2L, 7L), y = c(1L, 5L, 4L), z = c("a",
"b", "c")), class = "data.frame", row.names = c("1", "2", "3"
))

Counting amount of zeros within a "melted" data frame

Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?
With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100
sum(test_melt$value==0)
This should do it.
This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count

Tidyr how to spread into count of occurrence [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
Have a data frame like this
other=data.frame(name=c("a","b","a","c","d"),result=c("Y","N","Y","Y","N"))
How can I use spread function in tidyr or other function to get the count of result Y or N as column header like this
name Y N
a 2 0
b 0 1
Thanks
These are a few ways of many to go about it:
1) With library dplyr, you can simply group things and count into the format needed:
library(dplyr)
other %>% group_by(name) %>% summarise(N = sum(result == 'N'), Y = sum(result == 'Y'))
Source: local data frame [4 x 3]
name N Y
<fctr> <int> <int>
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0
2) You can use a combination of table and tidyr spread as follows:
library(tidyr)
spread(as.data.frame(table(other)), result, Freq)
name N Y
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0
3) You can use a combination of dplyr and tidyr to do as follows:
library(dplyr)
library(tidyr)
spread(count(other, name, result), result, n, fill = 0)
Source: local data frame [4 x 3]
Groups: name [4]
name N Y
<fctr> <dbl> <dbl>
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0
Here is another option using dcast from data.table
library(data.table)
dcast(setDT(other), name~result, length)
# name N Y
#1: a 0 2
#2: b 1 0
#3: c 0 1
#4: d 1 0
Although, table(other) would be a compact option (from #mtoto's comments), for large datasets, it may be more efficient to use dcast. Some benchmarks are given below
set.seed(24)
other1 <- data.frame(name = sample(letters, 1e6, replace=TRUE),
result = sample(c("Y", "N"), 1e6, replace=TRUE), stringsAsFactors=FALSE)
other2 <- copy(other1)
gopala1 <- function() other1 %>%
group_by(name) %>%
summarise(N = sum(result == 'N'), Y = sum(result == 'Y'))
gopala2 <- function() spread(as.data.frame(table(other1)), result, Freq)
gopala3 <- function() spread(count(other1, name, result), result, n, fill = 0)
akrun <- function() dcast(as.data.table(other2), name~result, length)
library(microbenchmark)
microbenchmark(gopala1(), gopala2(), gopala3(),
akrun(), unit='relative', times = 20L)
# expr min lq mean median uq max neval
# gopala1() 2.710561 2.331915 2.142183 2.325167 2.134399 1.513725 20
# gopala2() 2.859464 2.564126 2.531130 2.683804 2.720833 1.982760 20
# gopala3() 2.345062 2.076400 1.953136 2.027599 1.882079 1.947759 20
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20

grouped operations that result in length not equal to 1 or length of group in dplyr

I'm not sure which function to use to do the following:
library(data.table)
dt = data.table(a = 1:4, b = 1:2)
dt[, rep(a[1], 3), by = b]
# b V1
#1: 1 1
#2: 1 1
#3: 1 1
#4: 2 2
#5: 2 2
#6: 2 2
Both summarise and mutate are unhappy with this length:
library(dplyr)
df = data.frame(a = 1:4, b = 1:2)
df %.% group_by(b) %.% summarise(rep(a[1], 3))
#Error: expecting a single value
df %.% group_by(b) %.% mutate(rep(a[1], 3))
#Error: incompatible size (3), expecting 2 (the group size) or 1
In dplyr version 0.2 you could do this using the do operator:
> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
#
# b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2
While #beginneR's answer does work, it doesn't seem to be a real substitute to the data.table behavior. Consider:
df <- data.frame(a = 1, b = rep(1:1e4, 2))
dt <- data.table(df)
microbenchmark(times=5,
dt[, rep(a[1], 3), by = b],
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
)
has the dplyr implementation >200x slower.
Unit: milliseconds
expr min lq median uq
dt[, rep(a[1], 3), by = b] 13.14318 13.70248 14.60524 15.26676
df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3))) 3269.40731 3359.11614 3583.19430 3736.67162
Maybe there is a better way to do this with do that doesn't require calling data.frame each do? Also, the syntax is a bit involved for what is something very simple in data.table.
Otherwise, as per Hadley's issue link, it seems this is expected to be implemented in dplyr in 3.1, which looks to be the next release.

Combine tables of aggregated values with summarised variables from 'parent' data set

I have a data set along these lines:
df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)
I want to aggregate df$vals and bring the other variables through as well
At the moment I'm doing it like this:
multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}
aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)
aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)
result<-merge(aggDescrip,aggVals)
This works well enough but I wondered if there's an easier way.
Thanks
Perhaps you should look into the data.table package.
library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)),
country = unique(as.character(country)),
n = .N, min = min(vals), max = max(vals),
mean = mean(vals)), by=key(DT)]
# sp type country n min max mean
# 1: 100 C A 3 1 3 2
# 2: 101 H B 3 4 6 5
# 3: 102 C C 3 7 9 8
If you want to stick with base R, here is another approach that might be of use (though aggregate is probably more common):
unique(within(df, {
mean <- ave(vals, sp, FUN=mean)
max <- ave(vals, sp, FUN=max)
min <- ave(vals, sp, FUN=min)
n <- ave(vals, sp, FUN=length)
rm(vals)
}))
# sp type country n min max mean
# 1 100 C A 3 1 3 2
# 4 101 H B 3 4 6 5
# 7 102 C C 3 7 9 8
Update: A variation on your initial attempt
I would suggest sticking with data.table if possible, because the resulting code is easy to follow and the process of aggregation is quick.
However, with a little bit of modification, you can have (yet another) base R approach that is somewhat more direct.
First, modify your function so that instead of using c(), use data.frame. Also, add an argument that specifies which column needs to be aggregated.
multi.func <- function(x, value_column) {
data.frame(
n = length(x[[value_column]]),
min = min(x[[value_column]], na.rm=TRUE),
max = max(x[[value_column]], na.rm=TRUE),
mean = mean(x[[value_column]], na.rm=TRUE))
}
Second, use lapply on your dataset, split up by your grouping variable, merge the output with your original dataset, and return the unique values.
unique(merge(df[-4],
do.call(rbind, lapply(split(df, df$sp),
multi.func, value_column = "vals")),
by.x = "sp", by.y = "row.names"))
Using just aggregate:
result <- aggregate(vals ~ type + sp + country, df,
function(x) c(length(x), min(x), max(x), mean(x))
)
result
type sp country vals.1 vals.2 vals.3 vals.4
1 C 100 A 3 1 3 2
2 H 101 B 3 4 6 5
3 C 102 C 3 7 9 8
colnames(result)
[1] "type" "sp" "country" "vals"
The above seems to create a weird "multi-value" column. But summaryBy from the doBy package is similar to aggregate but will allow an output with multiple columns:
library(doBy)
result <- summaryBy(vals ~ type + sp + country, df,
FUN=function(x) c(n=length(x), min=min(x), max=max(x), mean=mean(x))
)
result
type sp country vals.n vals.min vals.max vals.mean
1 C 100 A 3 1 3 2
2 C 102 C 3 7 9 8
3 H 101 B 3 4 6 5
colnames(result)
[1] "type" "sp" "country" "vals.n" "vals.min" "vals.max"
[7] "vals.mean"

Resources