how can I mutate in dplyr without losing order? - r

Using data.table I can do the following:
library(data.table)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
# a b
#1: 1 1
#2: 2 2
#3: 1 NA
#4: 2 NA
dt[, b := b[1], by = a]
# a b
#1: 1 1
#2: 2 2
#3: 1 1
#4: 2 2
Attempting the same operation in dplyr however the data gets scrambled/sorted by a:
library(dplyr)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
dt %.% group_by(a) %.% mutate(b = b[1])
# a b
#1 1 1
#2 1 1
#3 2 2
#4 2 2
(as an aside the above also sorts the original dt, which is somewhat confusing for me given dplyr's philosophy of not modifying in place - I'm guessing that's a bug with how dplyr interfaces with data.table)
What's the dplyr way of achieving the above?

In the current development version of dplyr (which will eventually
become dplyr 0.2) the behaviour differs between data frames and data
tables:
library(dplyr)
library(data.table)
df <- data.frame(a = 1:2, b = c(1,2,NA,NA))
dt <- data.table(df)
df %.% group_by(a) %.% mutate(b = b[1])
## Source: local data frame [4 x 2]
## Groups: a
##
## a b
## 1 1 1
## 2 2 2
## 3 1 1
## 4 2 2
dt %.% group_by(a) %.% mutate(b = b[1])
## Source: local data table [4 x 2]
## Groups: a
##
## a b
## 1 1 1
## 2 1 1
## 3 2 2
## 4 2 2
This happens because group_by() applied to a data.table
automatically does setkey() on the assumption that the index will make
future operations faster.
If there's a strong feeling that this is a bad default, I'm happy to change it.

Related

Apply function to a row in a data.frame using dplyr

In base R I would do the following:
d <- data.frame(a = 1:4, b = 4:1, c = 2:5)
apply(d, 1, which.max)
With dplyr I could do the following:
library(dplyr)
d %>% mutate(u = purrr::pmap_int(list(a, b, c), function(...) which.max(c(...))))
If there’s another column in d I need to specify it, but I want this to work w/ an arbitrary amount if columns.
Conceptually, I’d like something like
pmap_int(list(everything()), ...)
pmap_int(list(.), ...)
But this does obviously not work. How would I solve that canonically with dplyr?
We just need the data to be specified as . as data.frame is a list with columns as list elements. If we wrap list(.), it becomes a nested list
library(dplyr)
d %>%
mutate(u = pmap_int(., ~ which.max(c(...))))
# a b c u
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or can use cur_data()
d %>%
mutate(u = pmap_int(cur_data(), ~ which.max(c(...))))
Or if we want to use everything(), place that inside select as list(everything()) doesn't address the data from which everything should be selected
d %>%
mutate(u = pmap_int(select(., everything()), ~ which.max(c(...))))
Or using rowwise
d %>%
rowwise %>%
mutate(u = which.max(cur_data())) %>%
ungroup
# A tibble: 4 x 4
# a b c u
# <int> <int> <int> <int>
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or this is more efficient with max.col
max.col(d, 'first')
#[1] 2 2 3 3
Or with collapse
library(collapse)
dapply(d, which.max, MARGIN = 1)
#[1] 2 2 3 3
which can be included in dplyr as
d %>%
mutate(u = max.col(cur_data(), 'first'))
Here are some data.table options
setDT(d)[, u := which.max(unlist(.SD)), 1:nrow(d)]
or
setDT(d)[, u := max.col(.SD, "first")]

count by all variables / count distinct with dplyr

Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?
We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)
For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3
Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

dplyr: access current group variable

After using data.table for quite some time I now thought it's time to try dplyr. It's fun, but I wasn't able to figure out how to access
the current grouping variable
returning multiple values per group
The following example shows is working fine with data.table. How would you write this with dplyr
library(data.table)
foo <- matrix(c(1, 2, 3, 4), ncol = 2)
dt <- data.table(a = c(1, 1, 2), b = c(4, 5, 6))
# data.table (expected)
dt[, .(c = foo[, a]), by = a]
a c
1: 1 1
2: 1 2
3: 2 3
4: 2 4
# dplyr (?)
library(dplyr)
dt %>%
group_by(a) %>%
summarize(c = foo[a])
We can use do from dplyr. (No other packages used). The do is very handy for expanding rows. We only need to wrap with data.frame.
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, unique(.$a)]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or instead of unique we can subset by the 1st observation
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, .$a[1]]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or with dplyr >= 1.0.0 (EDIT: Based on #Todd West comments)
dt %>%
reframe(c = foo[, cur_group()$a], .by = 'a')
a c
1 1 1
2 1 2
3 2 3
4 2 4
This can be also done without using any packages
stack(lapply(split(dt$a, dt$a), function(x) foo[,unique(x)]))[2:1]
# ind values
#1 1 1
#2 1 2
#3 2 3
#4 2 4
You can still access the group variable but it is like a normal vector with one unique value for each group, so if you put unique around it, it will work. And at same time, dplyr does not seem to expand rows like data.table automatically, you will need the unnest from tidyr package:
library(dplyr); library(tidyr)
dt %>%
group_by(a) %>%
summarize(c = list(foo[,unique(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
Or we can use first to speed up, since we've already know the group variable vector is the same for every group:
dt %>%
group_by(a) %>%
summarize(c = list(foo[,first(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
To access a grouping variable in a grouped operation (map, walk, mutate), we can refer to .y which is exposed automatically within the evaluation context.
Example
> iris %>% group_by(Species) %>% group_walk(~{ print(.y) })
# A tibble: 1 x 1
Species
<fct>
1 setosa
# A tibble: 1 x 1
Species
<fct>
1 versicolor
# A tibble: 1 x 1
Species
<fct>
1 virginica
This is also documented with more details in https://dplyr.tidyverse.org/reference/group_map.html
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Regarding the other proposed solutions: Afaik do is not recommended any longer, and the other solution with unqiue is imho clumsy (as it requires another reference to the dataframe in question).

combine different row's values in table in r

I need to re-format a table in R.
I have a table like this.
ID category
1 a
1 b
2 c
3 d
4 a
4 c
5 a
And I want to reform it as
ID category1 category2
1 a b
2 c null
3 d null
4 a c
5 a null
Is this doable in R?
This is a very straightforward "long to wide" type of reshaping problem, but you need a secondary "id" (or "time") variable.
You can try using getanID from my "splitstackshape" package and use dcast to reshape from long to wide. getanID will create a new column called ".id" that would be used as your "time" variable:
library(splitstackshape)
dcast.data.table(getanID(mydf, "ID"), ID ~ .id, value.var = "category")
# ID 1 2
# 1: 1 a b
# 2: 2 c NA
# 3: 3 d NA
# 4: 4 a c
# 5: 5 a NA
Same as Ananda's, but using dplyr and tidyr:
library(tidyr)
library(dplyr)
mydf %>% group_by(ID) %>%
mutate(cat_row = paste0("category", 1:n())) %>%
spread(key = cat_row, value = category)
# Source: local data frame [5 x 3]
#
# ID category1 category2
# 1 1 a b
# 2 2 c NA
# 3 3 d NA
# 4 4 a c
# 5 5 a NA

Resources