Tidyr how to spread into count of occurrence [duplicate]

Tidyr how to spread into count of occurrence [duplicate] - r

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
Have a data frame like this
other=data.frame(name=c("a","b","a","c","d"),result=c("Y","N","Y","Y","N"))
How can I use spread function in tidyr or other function to get the count of result Y or N as column header like this
name Y N
a 2 0
b 0 1
Thanks

These are a few ways of many to go about it:
1) With library dplyr, you can simply group things and count into the format needed:
library(dplyr)
other %>% group_by(name) %>% summarise(N = sum(result == 'N'), Y = sum(result == 'Y'))
Source: local data frame [4 x 3]
name N Y
<fctr> <int> <int>
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0
2) You can use a combination of table and tidyr spread as follows:
library(tidyr)
spread(as.data.frame(table(other)), result, Freq)
name N Y
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0
3) You can use a combination of dplyr and tidyr to do as follows:
library(dplyr)
library(tidyr)
spread(count(other, name, result), result, n, fill = 0)
Source: local data frame [4 x 3]
Groups: name [4]
name N Y
<fctr> <dbl> <dbl>
1 a 0 2
2 b 1 0
3 c 0 1
4 d 1 0

Here is another option using dcast from data.table
library(data.table)
dcast(setDT(other), name~result, length)
# name N Y
#1: a 0 2
#2: b 1 0
#3: c 0 1
#4: d 1 0
Although, table(other) would be a compact option (from #mtoto's comments), for large datasets, it may be more efficient to use dcast. Some benchmarks are given below
set.seed(24)
other1 <- data.frame(name = sample(letters, 1e6, replace=TRUE),
result = sample(c("Y", "N"), 1e6, replace=TRUE), stringsAsFactors=FALSE)
other2 <- copy(other1)
gopala1 <- function() other1 %>%
group_by(name) %>%
summarise(N = sum(result == 'N'), Y = sum(result == 'Y'))
gopala2 <- function() spread(as.data.frame(table(other1)), result, Freq)
gopala3 <- function() spread(count(other1, name, result), result, n, fill = 0)
akrun <- function() dcast(as.data.table(other2), name~result, length)
library(microbenchmark)
microbenchmark(gopala1(), gopala2(), gopala3(),
akrun(), unit='relative', times = 20L)
# expr min lq mean median uq max neval
# gopala1() 2.710561 2.331915 2.142183 2.325167 2.134399 1.513725 20
# gopala2() 2.859464 2.564126 2.531130 2.683804 2.720833 1.982760 20
# gopala3() 2.345062 2.076400 1.953136 2.027599 1.882079 1.947759 20
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20

Related

efficient way to rowwise mutate with sample

For each 0 in x, I want to randomly insert a number between 1:10 but i'm looking for an efficent way to do this in dplyr and/or data.table as I have a very large dataset (10m rows).
library(tidyverse)
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
df
# x
# 1 1
# 2 2
# 3 3
# 4 0
# 5 5
# 6 0
# 7 7
# 8 8
# 9 9
# 10 10
This doesnt work as it replaces each year with the same value:
set.seed(1)
df %>%
mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
# x x2
# 1 1 1
# 2 2 2
# 3 3 3
# 4 0 9
# 5 5 5
# 6 0 9
# 7 7 7
# 8 8 8
# 9 9 9
# 10 10 10
It can be achieved though with rowwise but is slow on a large dataset:
set.seed(1)
#use rowwise
df %>%
rowwise() %>%
mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
# x x2
# <dbl> <dbl>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 0 9
# 5 5 5
# 6 0 4
# 7 7 7
# 8 8 8
# 9 9 9
# 10 10 10
Any suggestions to speed this up?
Thanks

Not in tidyverse, but you could just do something like this:
is_zero <- (df$x == 0)
replacements <- sample(1:10, sum(is_zero))
df$x[is_zero] <- replacements
Of course, you can collapse that down if you'd like.
df$x[df$x == 0] <- sample(1:10, sum(df$x == 0))

Using the above solutions and microbenchmark and a slight modification to the dataset for setup:
library(data.table)
library(tidyverse)
df <- data.frame(x = 1:100000, y = rbinom(100000, size = 1, 0.5)) %>%
mutate(x = ifelse(y == 0, 0, x)) %>%
dplyr::select(-y)
dt <- setDT(df)
test <- microbenchmark::microbenchmark(
base1 = {
df$x[df$x == 0] <- sample(1:10, sum(df$x == 0), replace = T)
},
dplyr1 = {
df %>%
mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0), replace = T)))
},
dplyr2 = {
df %>% group_by(id=row_number()) %>%
mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1, replace = T), .)) )) %>%
ungroup() %>% select(-id)
},
data.table = {
dt[x == 0, x := sample(1:10, .N, replace = T)]
},
times = 500L
)
test
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base1 733.7 785.9 979.0938 897.25 1137.0 1839.4 500 a
# dplyr1 5207.1 5542.1 6129.2276 5967.85 6476.0 21790.7 500 a
# dplyr2 15963406.4 16156889.2 16367969.8704 16395715.00 16518252.9 19276215.5 500 b
# data.table 1547.4 2229.3 2422.1278 2455.60 2573.7 15076.0 500 a
I thought data.table would be quickest but the base solution seems best (assuming I've set up the mircobenchmark correctly?).
EDIT based on #chinsoon12 comment
1e5 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
base1 730.4 839.30 1380.465 1238.00 1322.85 28977.3 500 a
data.table 1394.8 1831.85 2030.215 1946.95 2060.40 29821.9 500 b
1e6 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
base1 9.8703 11.6596 16.030715 11.76195 12.04145 326.0118 500 b
data.table 2.3772 2.7939 3.855672 3.04700 3.25900 61.4083 500 a
data.table is the quickest

Maybe try with across() from dplyr in this way:
library(tidyverse)
#Data
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
#Code
df %>% group_by(id=row_number()) %>%
mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1), .)) )) %>%
ungroup() %>% select(-id)
Output:
# A tibble: 10 x 2
x x_x2
<dbl> <dbl>
1 1 1
2 2 2
3 3 3
4 0 5
5 5 5
6 0 6
7 7 7
8 8 8
9 9 9
10 10 10

I am adding a different answer because there are already votes on the base option I provided. But here can be a dplyr way using replace.
library(dplyr)
df %>%
mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0))))

Here is a data.table option using similar logic to Adam's answer. This filters for rows that meet your criteria: x == 0, and then samples 1:10 .N times (which, without a grouping variable, is the number of rows of the filtered data.table).
library(data.table)
set.seed(1)
setDT(df)[x == 0, x := sample(1:10, .N)]
df
x
1: 1
2: 2
3: 3
4: 9
5: 5
6: 4
7: 7
8: 8
9: 9
10: 10

Row Minimum except certain columns

I have a data frame below. I need to find the the row min and max except few column that are characters.
df
x y z
1 1 1 a
2 2 5 b
3 7 4 c
I need
df
x y z Min Max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7

Another dplyr possibility could be:
df %>%
mutate(Max = do.call(pmax, select_if(., is.numeric)),
Min = do.call(pmin, select_if(., is.numeric)))
x y z Max Min
1 1 1 a 1 1
2 2 5 b 5 2
3 7 4 c 7 4
Or a variation proposed be #G. Grothendieck:
df %>%
mutate(Min = pmin(!!!select_if(., is.numeric)),
Max = pmax(!!!select_if(., is.numeric)))

Another base R solution. Subset only the columns with numbers and then use apply in each row to get the minimum and maximum value with range.
cbind(df, t(apply(df[sapply(df, is.numeric)], 1, function(x)
setNames(range(x, na.rm = TRUE), c("min", "max")))))
# x y z min max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7

1) This one-liner uses no packages:
transform(df, min = pmin(x, y), max = pmax(x, y))
giving:
x y z min max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
2) If you have many columns and don't want to list them all or determine yourself which are numeric then this also uses no packages.
ix <- sapply(df, is.numeric)
transform(df, min = apply(df[ix], 1, min), max = apply(df[ix], 1, max))
If your actual data has NAs and if you want to ignore them when taking the min or max then min, max, pmin and pmax all take an optional na.rm = TRUE argument.
Note
Lines <- "x y z
1 1 1 a
2 2 5 b
3 7 4 c"
df <- read.table(text = Lines)

1) We can use select_if. Here, we can use select_if to select the columns that are numeric, then with pmin, pmax get the rowwise min and max and bind it with the original dataset
library(dplyr)
library(purrr)
df %>%
select_if(is.numeric) %>%
transmute(Min = reduce(., pmin, na.rm = TRUE),
Max = reduce(., pmax, na.rm = TRUE)) %>%
bind_cols(df, .)
# x y z Min Max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
NOTE: Here, we use only a single expression of select_if
2) The same can be done in base R (no packages used)
i1 <- names(which(sapply(df, is.numeric)))
df['Min'] <- do.call(pmin, c(df[i1], na.rm = TRUE))
df['Max'] <- do.call(pmax, c(df[i1], na.rm = TRUE))
Also, as stated in the comments, this is generalized option. If it is only for two columns, just doing pmin(x, y) or pmax(x,y) is possible and that wouldn't check if the columns are numeric or not and it is not a general solution
NOTE: All of the solutions mentioned here are either answered first or from the comments with the OP
data
df <- structure(list(x = c(1L, 2L, 7L), y = c(1L, 5L, 4L), z = c("a",
"b", "c")), class = "data.frame", row.names = c("1", "2", "3"
))

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...

To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).

prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a

Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())

Counting amount of zeros within a "melted" data frame

Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?

With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100

sum(test_melt$value==0)
This should do it.

This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count

dplyr: max value in a group, excluding the value in each row?

I have a data frame that looks as follows:
> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]
g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4
I know how to add a column with the maximum x value for each group g:
> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max
1 A 7 7
2 A 3 7
3 B 5 9
4 B 9 9
5 B 2 9
6 C 4 4
But what I would like is to get is the maximum x value for each group g, excluding the x value in each row.
For the given example, the desired output would look like this:
Source: local data frame [6 x 3]
Groups: g
g x x_max x_max_exclude
1 A 7 7 3
2 A 3 7 7
3 B 5 9 9
4 B 9 9 5
5 B 2 9 9
6 C 4 4 NA
I thought I might be able to use row_number() to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf output:
> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g
g x x_max r x_max_exclude
1 A 7 7 1 -Inf
2 A 3 7 2 -Inf
3 B 5 9 1 -Inf
4 B 9 9 2 -Inf
5 B 2 9 3 -Inf
6 C 4 4 1 -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number() doesn't work would also be much appreciated. Thanks for the help.

You could try:
df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))
Which gives:
#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA
Benchmark
I've tried the solutions so far on the benchmark:
df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))
library(microbenchmark)
mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)
#Arun's data.table solution is the fastest:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a

Interesting problem. Here's one way using data.table:
require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
The idea is to order by column x and on those indices, we group by g. Since we've the ordered indices, for the first .N-1 rows, the max value is the value at .N. And for the .Nth row, it's the value at .N-1th row.
.N is a special variable that holds the number of observations in each group.
I'll leave it to you and/or the dplyr experts to translate this (or answer with another approach).

This is the best I've come up with so far. Not sure if there's a better way.
df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2)

Another way with a functional:
df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
We write a function called max_exclude that does the operation that you describe.
max_exclude <- function(v) {
res <- c()
for(i in seq_along(v)) {
res[i] <- suppressWarnings(max(v[-i]))
}
res <- ifelse(!is.finite(res), NA, res)
as.numeric(res)
}
It works with base R too:
df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude))
Source: local data frame [6 x 3]
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
Benchmark
Here's a lesson kids, beware of for loops!
big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T))
microbenchmark(
plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)),
plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)),
StevenB = (big.df %>%
group_by(g) %>%
mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x)))
),
Eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
)
Unit: milliseconds
expr min lq mean median uq max neval
plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100
plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100
StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100
Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100
Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100