unique() but only on consecutive rows - r

I am looking for equivalent of unique(), but done only on the consecutive rows. I.e., in the following example:
df <- data.frame(a = rep(c(1:3,1:3), each = 3), b = rep(c(4:6,4:6), each = 3))
unique(df)
# a b
#1 1 4
#4 2 5
#7 3 6
I want to actually get:
function_I_am_looking_for(df)
# a b
#1 1 4
#4 2 5
#7 3 6
#10 1 4
#13 2 5
#16 3 6

We can use rleid to create a grouping variable and slice the first row
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(a, b)) %>%
slice(1) %>%
ungroup %>%
select(-grp)
# A tibble: 6 x 2
# a b
# <int> <int>
#1 1 4
#2 2 5
#3 3 6
#4 1 4
#5 2 5
#6 3 6
Or the same with data.table syntax, grouped by rleid of 'a', b', extract the first elements row index (.I) and subset the rows with that
setDT(df)[df[, .I[1], .(rleid(a, b))]$V1]
Or using unique with by
unique(setDT(df)[, grp := rleid(a, b)], by = "grp")
Or, OP prefered version, solution for general data.frame using just a base functionality:
unique(cbind(rleidv(df), df))[,-1]

Related

Apply function to a row in a data.frame using dplyr

In base R I would do the following:
d <- data.frame(a = 1:4, b = 4:1, c = 2:5)
apply(d, 1, which.max)
With dplyr I could do the following:
library(dplyr)
d %>% mutate(u = purrr::pmap_int(list(a, b, c), function(...) which.max(c(...))))
If there’s another column in d I need to specify it, but I want this to work w/ an arbitrary amount if columns.
Conceptually, I’d like something like
pmap_int(list(everything()), ...)
pmap_int(list(.), ...)
But this does obviously not work. How would I solve that canonically with dplyr?
We just need the data to be specified as . as data.frame is a list with columns as list elements. If we wrap list(.), it becomes a nested list
library(dplyr)
d %>%
mutate(u = pmap_int(., ~ which.max(c(...))))
# a b c u
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or can use cur_data()
d %>%
mutate(u = pmap_int(cur_data(), ~ which.max(c(...))))
Or if we want to use everything(), place that inside select as list(everything()) doesn't address the data from which everything should be selected
d %>%
mutate(u = pmap_int(select(., everything()), ~ which.max(c(...))))
Or using rowwise
d %>%
rowwise %>%
mutate(u = which.max(cur_data())) %>%
ungroup
# A tibble: 4 x 4
# a b c u
# <int> <int> <int> <int>
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or this is more efficient with max.col
max.col(d, 'first')
#[1] 2 2 3 3
Or with collapse
library(collapse)
dapply(d, which.max, MARGIN = 1)
#[1] 2 2 3 3
which can be included in dplyr as
d %>%
mutate(u = max.col(cur_data(), 'first'))
Here are some data.table options
setDT(d)[, u := which.max(unlist(.SD)), 1:nrow(d)]
or
setDT(d)[, u := max.col(.SD, "first")]

What is the best way to apply a function to a range of values from another column in R data.frame so it remains vectorized?

I have several columns in R data.frame, and I want to create a new column based on ranges of values from some already existing column. Those ranges are not regular and are determined by start and end values written in first two columns. I want the calculation to remain vectorized. I don't want a for loop underneath.
required result, achieved with a for loop:
df = data.frame(start=c(2,1,4,4,1), end=c(3,3,5,4,2), values=c(1:5))
for (i in 1:nrow(df)) {
df[i, 'new'] <- sum(df[df[i, 'start']:df[i, 'end'], 'values'])
}
df
Here is a base R one-liner.
mapply(function(x1, x2, y){sum(y[x1:x2])}, df[['start']], df[['end']], MoreArgs = list(y = df[['values']]))
#[1] 5 6 9 4 3
And another one.
sapply(seq_len(nrow(df)), function(i) sum(df[['values']][df[i, 'start']:df[i, 'end']]))
#[1] 5 6 9 4 3
here is an option with map2
library(purrr)
library(dplyr)
df %>%
mutate(new = map2_dbl(start, end, ~ sum(values[.x:.y])))
-output
# start end values new
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or with rowwise
df %>%
rowwise %>%
mutate(new =sum(.$values[start:end])) %>%
ungroup
-output
# A tibble: 5 x 4
# start end values new
# <dbl> <dbl> <int> <int>
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or using data.table
library(data.table)
setDT(df)[, new := sum(df$values[start:end]), seq_len(nrow(df))]

How to reduce factor levels depending on other attribute?

I have a dataframe of two columns id and result, and I want to assign factor levels to result depending on id. So that for id "1", result c("a","b","c","d") will have factor levels 1,2,3,4.
For id "2", result c("22","23","24") will have factor levels 1,2,3.
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
I tried to group them by split, but they will be converted to a list instead of a data frame, which causes a length problem for modeling. Can you help please?
Though the question was closed as a duplicate by user #Ronak Shah, I don't believe it is the same question.
After numbering the row by group the new column must be coerced to class "factor".
library(dplyr)
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
df <- data.frame(id, result)
df %>%
group_by(id) %>%
mutate(fac = row_number()) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 7 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 23 2
#7 2 24 3
Edit.
If there are repeated values in result, coerce as.integer/factor to get numbers, then coerce those numbers to factor.
id2 <- c(1,1,1,1,2,2,2,2)
result2 <- c("a","b","c","d","22", "22","23","24")
df2 <- data.frame(id = id2, result = result2)
df2 %>%
group_by(id) %>%
mutate(fac = as.integer(factor(result))) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 8 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3
After grouping by id, we can use match with unique to assign unique number to each result. Using #Rui Barradas' dataframe df2
library(dplyr)
df2 %>%
group_by(id) %>%
mutate(ans = match(result, unique(result))) %>%
ungroup %>%
mutate(ans = factor(ans))
# id result ans
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Determine subgroup index

I have a large data frame with groups and subgroups. I would like to determine the index of the subgroup in each group, like shown in the OUTPUT column of the following data frame:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
I've tried several possibilities with without any success. I'd like to work with dplyr, but I'm not sure how to go about this. The following code returns an unexpected result.
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
I've the feeling I'm close, but not getting there. Can anybody help?
Here is a solution with data.table without aggregation:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
This will be much faster compared to methods based on aggregation.
We can use the factor route with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
Or another option is match with the unique elements of 'Subgroup' after grouping by 'Group'
df %>%
group_by(Group) %>%
mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
# Group Subgroup OUTPUT
# <fctr> <fctr> <int>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
Translation to dplyr should be straightforward.
Another method, following the idea of using factors from aosmith's comment, is:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
This will create a factor with correct levels per Group which is the indexing you're after.

Resources