I have a DF like this:
ID Product
1 1
1 2
2 2
2 3
2 4
3 1
3 2
4 3
Now I need in R a possibility like here:
finding products that customers bought together
My output should look like this:
ProductX ProductY Times
1 2 2
2 3 1
3 4 1
Nice would be a matrix output lile this:
[,1] [,2] [,3] [,4]
[1,] 1 2 0 0
[2,] 2 1 1 0
[3,] 0 0 1 1
[4,] 0 0 1 1
I have tried it with the reshape2 package but I don't know how to get this output form.
Here's an answer using data.table and non-equi self-join. Edit: added allow.cartesian = TRUE to allow it work with large datasets.
library(data.table)
dt[dt
, on = .(ID = ID, Product < Product)
, .(ProductX = x.Product, ProductY = i.Product)
, nomatch = 0L
, allow.cartesian = TRUE
][, .N, by = .(ProductX, ProductY)]
ProductX ProductY N
1: 1 2 2
2: 2 3 1
3: 2 4 1
4: 3 4 1
Dplyr equivalent:
library(dplyr)
inner_join(tib, tib, by = 'ID')%>%
filter(Product.x < Product.y)%>%
count(Product.x, Product.y)
Product.x Product.y n
<dbl> <dbl> <int>
1 1 2 2
2 2 3 1
3 2 4 1
4 3 4 1
And here's a base R version as well:
aggregate(ID ~ Product.x + Product.y
, data = merge(df, df, by = 'ID')
, subset = Product.x < Product.y
, FUN = length)
# Need to change the names from ID to n
Product.x Product.y ID
1 1 2 2
2 2 3 1
3 2 4 1
4 3 4 1
Performance:
Unit: milliseconds
expr min lq mean median uq max neval
dt_way 3.9149 4.29330 4.593209 4.6597 4.80210 6.2326 100
dplyr_inner_join 1.8218 1.91510 2.058864 2.0572 2.16205 3.0157 100
dplyr_tidyr 13.8107 14.15735 16.020262 14.3571 14.78975 127.9654 100
base_agg 2.3393 2.51215 2.586652 2.5804 2.63865 3.4415 100
n_IDs <- 1E3
n_Sims <- 1E5
ID_big <- sample(1:n_IDs, n_Sims, replace = TRUE)
Product_big <- sample(1:n_Sims, n_Sims, replace = TRUE)
: seconds
expr min lq mean median uq max neval
dt_way 1.633111 1.904460 1.998192 1.986452 2.110937 2.308671 10
dplyr_inner_join 5.606322 6.361026 6.574015 6.606423 6.839273 7.198770 10
dplyr_tidyr 8.385418 9.350730 10.127512 10.372830 10.675809 11.462403 10
Data:
ID <- c(1,1,2,2,2,3,3,4)
Product <- c(1,2,2,3,4,1,2,3)
dt <- data.table(ID, Product)
tib <- tibble(ID, Product)
df <- data.frame(ID, Product)
A different dplyr and tidyr possibility could be:
df %>%
group_by(ID) %>%
expand(Product, Product) %>%
filter(Product < Product1) %>%
ungroup() %>%
count(Product, Product1)
Product Product1 n
<int> <int> <int>
1 1 2 2
2 2 3 1
3 2 4 1
4 3 4 1
However, I'm not sure what is the expected output for IDs that bought the same pair of products together (if there could be such a scenario) on multiple occasions.
Here is one option using dplyr and tidyr. We group_by ID and create list of combinations taking a pair of Product at a time. For ID which have one Product we remove the NA values and finally count the combinations of Product.
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
summarise(new_col = if (n() == 1) list(as.character(Product)) else
list(combn(sort(Product), 2, paste0, collapse = ","))) %>%
unnest() %>%
separate(new_col, c("ProductX", "ProductY"), sep = ",", fill = "right") %>%
na.omit %>%
count(ProductX, ProductY)
# A tibble: 4 x 3
# ProductX ProductY n
# <chr> <chr> <int>
#1 1 2 2
#2 2 3 1
#3 2 4 1
#4 3 4 1
Related
I have a data frame like this
x1<- c(0,1,1,1,1,0)
df<-data.frame(x1)
I want to add another column that will take the sum of every two rows and store the value for the first two rows. This should look like this.
You can see here that the first two rows' sum is 1 and that is given in the first two rows of the new column (x2). Next, the third and fourth-row sum is given in the 3rd and fourth row of the new column. Can anyone help?
You can define the groups using floor division and then simply obtain the grouped sum:
library(dplyr)
df %>%
mutate(group = (row_number() - 1) %/% 2) %>%
group_by(group) %>%
mutate(x2 = sum(x1)) %>%
ungroup() %>%
select(-group)
# # A tibble: 6 × 2
# x1 x2
# <dbl> <dbl>
# 1 0 1
# 2 1 1
# 3 1 2
# 4 1 2
# 5 1 1
# 6 0 1
Here a way using dplyr where I create a auxiliar column to group by
library(dplyr)
x1<- c(0,1,1,1,1,0)
df <- data.frame(x1)
len_df <- nrow(df)
aux <- rep(seq(1:(len_df/2)),each = 2)[1:len_df]
df %>%
mutate(aux = aux) %>%
group_by(aux) %>%
mutate(x2 = sum(x1)) %>%
ungroup() %>%
select(-aux)
# A tibble: 6 x 2
x1 x2
<dbl> <dbl>
1 0 1
2 1 1
3 1 2
4 1 2
5 1 1
6 0 1
Create an index with gl for every 2 rows and do the sum after grouping
library(dplyr)
df <- df %>%
group_by(grp = as.integer(gl(n(), 2, n()))) %>%
mutate(x2 = sum(x1)) %>%
ungroup %>%
select(-grp)
-output
df
# A tibble: 6 × 2
x1 x2
<dbl> <dbl>
1 0 1
2 1 1
3 1 2
4 1 2
5 1 1
6 0 1
Or using collapse/data.table
library(data.table)
library(collapse)
setDT(df)[, x2 := fsum(x1, g = rep(.I, each = 2, length.out = .N), TRA = 1)]
-output
> df
x1 x2
<num> <num>
1: 0 1
2: 1 1
3: 1 2
4: 1 2
5: 1 1
6: 0 1
You can use ave + ceiling (both are base R functions)
> transform(df, x2 = ave(x1, ceiling(seq_along(x1) / 2)) * 2)
x1 x2
1 0 1
2 1 1
3 1 2
4 1 2
5 1 1
6 0 1
First, a way of making the data.frame without the intermediate variable.
This splits the data.frame into groups of 2, sums, then repeats the pattern into the new variable.
df<-data.frame(x1=c(0,1,1,1,1,0))
df$x2<-rep(lapply(split(df, rep(1:3, each=2)), sum), each=2)
# x1 x2
#1 0 1
#2 1 1
#3 1 2
#4 1 2
#5 1 1
#6 0 1
in base R you could do:
transform(df,x2 = ave(x1, gl(nrow(df)/2, 2), FUN = sum))
x1 x2
1 0 1
2 1 1
3 1 2
4 1 2
5 1 1
6 0 1
A few more options with select benchmarks.
x1 <- sample(0:1, 1e4, 1)
microbenchmark::microbenchmark(
matrix = rep(colSums(matrix(x1, 2)), each = 2),
recycle = x1 + x1[seq(x1) + c(1, -1)],
cumsum = rep(diff(cumsum(c(0, x1))[seq(1, length(x1) + 1, 2)]), each = 2),
Thomas = ave(x1, ceiling(seq_along(x1)/2))*2,
onyambu = ave(x1, gl(length(x1)/2, 2), FUN = sum),
check = "equal"
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> matrix 65.001 69.6510 79.27203 78.4510 82.1510 148.501 100
#> recycle 95.001 100.6505 108.65003 107.5510 110.6010 176.901 100
#> cumsum 137.201 148.9010 169.61090 166.5505 177.7015 340.002 100
#> Thomas 24645.401 25297.2010 26450.46994 25963.3515 27463.2010 31803.101 100
#> onyambu 3774.902 3935.7510 4444.36500 4094.3520 4336.1505 11070.301 100
With data.table for large data:
library(data.table)
library(collapse)
x1 <- sample(0:1, 1e6, 1)
df <- data.frame(x1)
microbenchmark::microbenchmark(
matrix = setDT(df)[, x2 := rep(colSums(matrix(x1, 2)), each = 2)],
recycle = setDT(df)[, x2 := x1 + x1[.I + c(1, -1)]],
akrun = setDT(df)[, x2 := fsum(x1, g = rep(.I, each = 2, length.out = .N), TRA = 1)],
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> matrix 8.053302 8.937301 10.64786 9.376551 12.51890 17.2037 100
#> recycle 12.117101 12.965950 16.57696 14.003151 17.09805 56.4729 100
#> akrun 10.071701 10.611051 14.42578 11.291601 14.79090 55.1141 100
I have two dataframes df1 and df2. They both have a column 'ID'. For each row in DF1, I would like to find out how many duplicates of its ID there are in df2 and add the count to that row. If there are no duplicates, the count should return as 0.
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 2_222 1 1
# 4 2_654 1 2
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 1_234 1 1
# 4 3_234 1 2
Using dplyr:
Your data:
df1 <- data.frame(ID = c("1_234","1_235","2_222","2_654"),
a = c(1,1,1,1),
b = c(1,2,1,2))
df2 <- data.frame(ID = c("1_234","1_235","1_234","3_235"),
a = c(1,1,1,1),
b = c(1,2,1,2))
Edit: considering only the IDs:
output <- left_join(df1,
as.data.frame(table(df2$ID)),
by = c("ID" = "Var1")) %>%
mutate(Freq = ifelse(is.na(Freq), 0, Freq))
Output:
ID a b Freq
1 1_234 1 1 2
2 1_235 1 2 1
3 2_222 1 1 0
4 2_654 1 2 0
A base R option using subset + aggregate
subset(
aggregate(
n ~ .,
rbind(
cbind(df1, n = 1),
cbind(df2, n = 1)
), function(x) length(x) - 1
), ID %in% df1$ID
)
gives
ID a b n
1 1_234 1 1 2
2 2_222 1 1 0
3 1_235 1 2 1
4 2_654 1 2 0
I think you can do it with a simple sapply() and base r (no extra packages).
df1$count <- sapply(df1$ID, function(x) sum(df2$ID == x))
We may also use outer
df1$count <- rowSums(outer(df1$ID, df2$ID, FUN = `==`))
df1$count
[1] 2 1 0 0
We could use semi_join and n() to get the count of duplicates:
library(dplyr)
df1 %>%
semi_join(df2, by="ID") %>%
summarise(duplicates_df1_df2 = n())
Output:
duplicates_df1_df2
1 2
For the dataframe below, the there are 59 columns
circleid name birthday 56 more...
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
Result I want
circleid distinct_name distinct_birthday 56 more...
1 1 1
2 3 2
quiz <- read.csv("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
So far
ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(name)))
This works for 1 column how do i get the full dataframe
columns <- colnames(quiz)
for (i in c(1:58)
{
final <- ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(columns[i])))
}
with data.table you can run:
library(data.table)
quiz <- fread("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
unique_vals <- quiz[, lapply(.SD, uniqueN), by = circleid]
You can use dplyr:
result<-quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct)
microbenchmark for data.table and dplyr:
microbenchmark(x1=quiz[, lapply(.SD, function(x) length(unique(x))), by = circleid],
x2=quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct),times=100)
Unit: milliseconds
expr min lq mean median uq max neval cld
x1 150.06392 155.02227 158.75775 156.49328 158.38887 224.22590 100 b
x2 41.07139 41.90953 42.95186 42.54135 43.97387 49.91495 100 a
With package dplyr this is simple. The original answer had length(unique(.)) but #akrun pointed me to n_distinct(.) in a comment.
library(dplyr)
quiz %>%
group_by(circleid) %>%
summarise_all(n_distinct)
## A tibble: 2 x 3
#circleid name birthday
#<int> <int> <int>
# 1 1 1
# 2 2 3
Data.
quiz <- read.table(text = "
circleid name birthday
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
", header = TRUE)
This question should have a simple, elegant solution but I can't figure it out, so here it goes:
Let's say I have the following dataset and I want to count the number of 2s present in each row using dplyr.
set.seed(1)
ID <- LETTERS[1:5]
X1 <- sample(1:5, 5,T)
X2 <- sample(1:5, 5,T)
X3 <- sample(1:5, 5,T)
df <- data.frame(ID,X1,X2,X3)
library(dplyr)
Now, the following works:
df %>%
rowwise %>%
mutate(numtwos = sum(c(X1,X2,X3) == 2))
But how do I avoid typing out all of the column names?
I know this is probably easier to do without dplyr, but more generally I want to know how I can use dplyr's mutate with multiple columns without typing out all the column names.
Try rowSums:
> set.seed(1)
> ID <- LETTERS[1:5]
> X1 <- sample(1:5, 5,T)
> X2 <- sample(1:5, 5,T)
> X3 <- sample(1:5, 5,T)
> df <- data.frame(ID,X1,X2,X3)
> df
ID X1 X2 X3
1 A 2 5 2
2 B 2 5 1
3 C 3 4 4
4 D 5 4 2
5 E 2 1 4
> rowSums(df == 2)
[1] 2 1 0 1 1
Alternatively, with dplyr:
> df %>% mutate(numtwos = rowSums(. == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
Here's another alternative using purrr:
library(purrr)
df %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# ID X1 X2 X3 numtwos
# <fctr> <int> <int> <int> <int>
#1 A 2 5 2 2
#2 B 2 5 1 1
#3 C 3 4 4 0
#4 D 5 4 2 1
#5 E 2 1 4 1
As per mentioned in the NEWS, row based functionals are still maturing in dplyr:
We are still figuring out what belongs in dplyr and what belongs in
purrr. Expect much experimentation and many changes with these
functions.
Benchmark
We can see how rowwise() and do() compare to purrr::by_row() for this type of problem and how they "perform" against rowSums() and the tidy data way:
largedf <- df[rep(seq_len(nrow(df)), 10e3), ]
library(microbenchmark)
microbenchmark(
steven = largedf %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"),
psidom = largedf %>%
rowwise %>%
do(data_frame(numtwos = sum(.[-1] == 2))) %>%
cbind(largedf, .),
gopala = largedf %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(largedf, .),
evan = largedf %>%
mutate(numtwos = rowSums(. == 2)),
times = 10L,
unit = "relative"
)
Results:
#Unit: relative
# expr min lq mean median uq max neval cld
# steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636 10 b
# psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492 10 c
# gopala 2.715005 2.684599 2.638425 2.612631 2.59827 2.572972 10 a
# evan 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 10 a
Just wanted to add to the answer of #evan.oman in case you only want to sum rows for specific columns, not all of them. You can use the regular select and/or select_helpers functions. In this example, we don't want to include X1 in rowSums:
df %>%
mutate(numtwos = rowSums(select(., -X1) == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 1
2 B 2 5 1 0
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 0
One approach is to use a combination of dplyr and tidyr to convert data into long format, and do the computation:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(df, .)
Output is as follows:
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
You can use do, which doesn't add the column to your original data frame and you need to add the column to your original data frame.
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame
numtwos
1 2
2 1
3 0
4 1
5 1
Add a cbind to bind the new column to the original data frame:
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame %>% cbind(df, .)
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1
The data set contains three variables: id, sex, and grade (factor).
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
For each ID, I need to see how many unique grades we have and then create a new column (call N) to record the grade frequency. For instance, for ID=1, we have five unique values for "grade", so N = 4; for ID=2, we have two unique values for "grade", so N = 2; for ID=4, we have two unique values for "grade" (ignore NA), so N = 2.
The final data set is
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)
New answer:
The uniqueN-function of data.table has a na.rm argument, which we can use as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade, na.rm = TRUE), by = id]
which gives:
> mydata
id sex grade n
1: 1 1 a 5
2: 1 1 b 5
3: 1 1 c 5
4: 1 1 d 5
5: 1 1 e 5
6: 2 0 x 2
7: 2 0 y 2
8: 2 0 y 2
9: 2 0 x 2
10: 3 0 q 1
11: 3 0 q 1
12: 3 0 q 1
13: 3 0 q 1
14: 4 1 a 2
15: 4 1 a 2
16: 4 1 a 2
17: 4 1 NA 2
18: 4 1 b 2
Old answer:
With data.table you could do this as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade[!is.na(grade)]), by = id]
or:
setDT(mydata)[, n := uniqueN(na.omit(grade)), by = id]
You could use the package data.table:
library(data.table)
setDT(mydata)
#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]
Very short, readable and fast. It can also be done in base-R:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times
mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))
Because there was discussion on what is faster, let's do some benchmarking.
Given dataset:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a
Larger dataset: (10000 observations, 1000 id's)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b
A dplyr option that makes use of dplyr::n_distinct and its na.rm-argument:
library(dplyr)
mydata %>% group_by(id) %>% mutate(N = n_distinct(grade, na.rm = TRUE))
#Source: local data frame [18 x 4]
#Groups: id [4]
#
# id sex grade N
# (dbl) (dbl) (fctr) (int)
#1 1 1 a 5
#2 1 1 b 5
#3 1 1 c 5
#4 1 1 d 5
#5 1 1 e 5
#6 2 0 x 2
#7 2 0 y 2
#8 2 0 y 2
#9 2 0 x 2
#10 3 0 q 1
#11 3 0 q 1
#12 3 0 q 1
#13 3 0 q 1
#14 4 1 a 2
#15 4 1 a 2
#16 4 1 a 2
#17 4 1 NA 2
#18 4 1 b 2
Looks like we have several votes for data.table, but you could also use the base R function ave():
mydata$N <- ave(as.character(mydata$grade),mydata$id,
FUN = function(x) length(unique(x[!is.na(x)])))
use tapply and lookup table
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4),
sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q",
"q","q","q", "a", "a", "a", NA, "b"))
uniqN <- tapply(mydata$grade, mydata$id, function(x) sum(!is.na(unique(x))))
mydata$N <- uniqN[mydata$id]
Here is a dplyr method. I kept the summary table separate for tidy reasons.
library(dplyr)
summary =
mydata %>%
distinct(id, grade) %>%
filter(grade %>% is.na %>% `!`) %>%
count(id)
mydata %>%
left_join(summary)