Grouped non-dense rank without omitted values - r

I have the following data.frame:
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
And I want to add a new column grp which, for each date, ranks the IDs. Ties should have the same value, but there should be no omitted values. That is, if there are two values which are equally minimum, they should both get rank 1, and the next lowest values should get rank 2.
The expected result would therefore look like this. Note that, as mentioned, the groups are for each date, so the operation must be grouped by date.
data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1),
grp = c(2, 2, 1, 2, 1, 2, 3, 1, 2, 2, 1, 1))
I'm sure there's a trivial way to do this but I haven't found it: none of the options for tie.method behave in this way (data.table::frank also doesn't help, since it only adds a dense rank).
I thought of doing a normal rank and then using data.table::rleid, but that doesn't work if there are duplicate values separated by other values during the same day.
I also thought of grouping by date and id and then using a group-ID, but the lowest values each day must start at rank 1, so that won't work either.
The only functional solution I've found is to create another table with the unique ids per day and then join that table to this one:
suppressPackageStartupMessages(library(dplyr))
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
uniques <- df %>%
group_by(
date
) %>%
distinct(
id
) %>%
mutate(
grp = rank(id)
)
df <- df %>% left_join(
unique
) %>% print()
#> Joining, by = c("date", "id")
#> date id grp
#> 1 1 4 2
#> 2 1 4 2
#> 3 1 2 1
#> 4 1 4 2
#> 5 2 1 1
#> 6 2 2 2
#> 7 2 3 3
#> 8 2 1 1
#> 9 3 2 2
#> 10 3 2 2
#> 11 3 1 1
#> 12 3 1 1
Created on 2020-05-08 by the reprex package (v0.3.0)
However, this seems quite inelegant and convoluted for what seems like a simple operation, so I'd rather see if other solutions are available.
Curious to see data.table solutions if available, but unfortunately the solution must be in dplyr.

We can use dense_rank
library(dplyr)
df %>%
group_by(date) %>%
mutate(grp = dense_rank(id))
# A tibble: 12 x 3
# Groups: date [3]
# date id grp
# <dbl> <dbl> <int>
# 1 1 4 2
# 2 1 4 2
# 3 1 2 1
# 4 1 4 2
# 5 2 1 1
# 6 2 2 2
# 7 2 3 3
# 8 2 1 1
# 9 3 2 2
#10 3 2 2
#11 3 1 1
#12 3 1 1
Or with frank
library(data.table)
setDT(df)[, grp := frank(id, ties.method = 'dense'), date]

Related

Select rows up to certain value in R

I have the following dataframe:
df1 <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
var1 = c(0, 2, 3, 4, 2, 5, 6, 10, 11, 0, 1, 2, 1, 5, 7, 10))
I want to select only the rows containing values up to 5, once 5 is reached I want it to go to the next ID and select only values up to 5 for that group so that the final result would look like this:
ID var1
1 0
1 2
1 3
1 4
1 2
1 5
2 0
2 1
2 2
2 1
2 5
I would like to try something with dplyr as it is what I am most familiar with.
You could use which.max() to find the first occurrence of var1 >= 5, and then extract those rows whose row numbers are before it.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(row_number() <= which.max(var1 >= 5)) %>%
ungroup()
or
df1 %>%
group_by(ID) %>%
slice(1:which.max(var1 >= 5)) %>%
ungroup()
# # A tibble: 11 × 2
# ID var1
# <dbl> <dbl>
# 1 1 0
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 2
# 6 1 5
# 7 2 0
# 8 2 1
# 9 2 2
# 10 2 1
# 11 2 5

Creating factor from multiple other factors fast

I have a data frame that looks like this:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1))
)
where id is an identifier for individuals in the data set and generation, income and fem are categorical characteristics of the individuals. Now, I want to put the individuals into cohorts ("groups") based on the individual characteristics, where individuals with the exact same values for the individual characteristics should get the same cohort_id. Hence, I want the following result:
data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1)),
cohort_id = as.factor(c(1, 2, 3, 4, 3))
)
Note that id = 3 and id = 5 get the same cohort_id as they have the same characteristcs.
My question is whether there is a fast way to create the cohort_ids without using multiple case_when or ifelse over and over again? This can get quite tedious if you want to build many cohorts. A solution using dplyr would be nice but is not necessary.
There are multiple ways to do this - one option is to paste the columns and match with the unique values
library(dplyr)
library(stringr)
df %>%
mutate(cohort_id = str_c(generation, income, fem),
cohort_id = match(cohort_id, unique(cohort_id)))
-output
id generation income fem cohort_id
1 1 3 4 0 1
2 2 2 3 0 2
3 3 4 3 1 3
4 4 3 7 0 4
5 5 4 3 1 3
The following code will create an index 'cohort_id' with values a little different from the provided expected, but compliant with the grouping rules:
library(dplyr)
df %>% group_by(generation, income, fem) %>%
mutate(cohort_id = cur_group_id())%>%
ungroup()
# A tibble: 5 × 5
id generation income fem cohort_id
<dbl> <fct> <fct> <fct> <int>
1 1 3 4 0 2
2 2 2 3 0 1
3 3 4 3 1 4
4 4 3 7 0 3
5 5 4 3 1 4

Generate random sequential number by group with multiple times

I'm trying to generate random number by group with multiple times.
For example,
> set.seed(1002)
> df<-data.frame(ID=LETTERS[seq(1:5)],num=sample(c(2,3,4), size=5, replace=TRUE))
> df
ID num
1 A 3
2 B 4
3 C 3
4 D 2
5 E 3
In ID, I want to generate sequential random number without replacement with (for example) 4 times.
If ID is A, it will randomly select numbers among 1:3 4 times. So, this will be
sample(c(1,2,3,1,2,3,1,2,3),replace=FALSE)
or
ep(sample(c(1:4), replace=FALSE),times=4)
If the results is 3 2 1 2 1 3 2 3 3 1 1 2, then the data will be
ID num
1 A 3
2 A 2
3 A 2
4 A 1
5 A 1
6 A 3
7 A 2
8 A 1
9 A 3
I tried several things, like
df%>%group_by(ID)%>%mutate(random=sample(rep(1:num,times=4),replace=FALSE))
It failed. The warning appeared with In 1:num
I also tried this.
ddply(df,.(ID),function(x) sample(rep(1:num,times=4),replace=FALSE))
The error appeared again, with NA/NaN.
I would really appreciate if you let me know how to solve this problem.
We can create a list-column and then unnest it to have separate rows.
n <- 4
library(dplyr)
df %>%
group_by(ID) %>%
mutate(num = list(sample(rep(seq_len(num), n)))) %>%
tidyr::unnest(num)
# ID num
# <fct> <int>
# 1 A 2
# 2 A 2
# 3 A 2
# 4 A 3
# 5 A 3
# 6 A 1
# 7 A 3
# 8 A 1
# 9 A 1
#10 A 3
# … with 50 more rows
I'm not quite clear on your expected output.
The following samples num elements from 1:num with replacement, and stores samples in a list column sample.
library(tidyverse)
set.seed(2018)
df %>% mutate(sample = map(num, ~sample(1:.x, replace = T)))
# ID num sample
#1 A 2 1, 1
#2 B 4 3, 4, 1, 2
#3 C 2 1, 1
#4 D 4 3, 3, 4, 4
#5 E 2 2, 2
Or if you want to repeat sampling num elements (with replacement) 4 times, you can do
set.seed(2018)
df %>%
mutate(sample = map(num, ~as.numeric(replicate(4, sample(1:.x, replace = T)))))
#ID num sample
#1 A 2 1, 1, 1, 2, 1, 2, 1, 1
#2 B 4 3, 3, 4, 4, 4, 4, 4, 2, 3, 4, 3, 3, 2, 1, 1, 2
#3 C 2 1, 1, 1, 1, 1, 1, 1, 2
#4 D 4 2, 3, 2, 1, 3, 4, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1
#5 E 2 2, 1, 2, 2, 1, 1, 1, 2

Create a count consecutive variable by a group variable

I have data like this:
df<-data.frame(one=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7),
test=c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0))
I want to sum the number of consecutive 'tests' by variable 'one', but importantly they have to be consecutive. So I'd want:
dfwant<-data.frame(one=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7),
test=c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0),
want=c(2, 2, 1, 1, 1, 2, 2, 3, 3, 3, 1, 1, 1, 0, 0))
I got pretty close with rle but was never able to make the new want column.
An attempt in base R using ave, grouping by the one column and a cumulative sum of values that are not equal to 1 in the test column:
ave(df$test, list(df$one, cumsum(df$test != 1)), FUN=function(x) if(any(x==1)) sum(x) else x )
# [1] 2 2 1 1 1 2 2 3 3 3 1 1 1 0 0
A shortening of this logic, with a hat-tip to #RonakShah is:
ave(df$test == 1, df$one, cumsum(df$test != 1), FUN = sum)
One option is rleid from data.table, grouped by the run-length-id of 'one', 'test', get the sum of 'test' as 'want', grouped by 'one', mutate 'want' as the max of 'want'
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(one, test))%>%
mutate(want = sum(test)) %>%
group_by(one) %>%
mutate(want = max(want)) %>%
dplyr::select(-grp)
# A tibble: 15 x 3
# Groups: one [7]
# one test want
# <dbl> <dbl> <dbl>
# 1 1 1 2
# 2 1 1 2
# 3 2 1 1
# 4 2 0 1
# 5 2 1 1
# 6 3 1 2
# 7 3 1 2
# 8 4 1 3
# 9 4 1 3
#10 4 1 3
#11 5 0 1
#12 5 1 1
#13 6 1 1
#14 7 0 0
#15 7 0 0
Or using data.table
setDT(df)[, want := max(tabulate(rleid(test))* test), .(one)]
You can use rle to obtain the lengths of different runs with 1 and then take the maximum of those lengths
library(dplyr)
df %>%
group_by(one) %>%
mutate(want = with(rle(test == 1), max(0, lengths[values], na.rm = TRUE)))

How to sum a substring reference

I'm attempting to select the correct column to sum the total of a from within a data frame column using ddply:
df2 <- ddply(df1,'col1', summarise, total = sum(substr(variable,1,3)))
It appears not to be working because you can't sum a character, but I am trying to pass the reference to the column, not sum the literal result of the substring. Is there a way to get around this?
Example Data & Desired output:
variable = "Aug 2017"
col1 Jun Jul Aug
1 A 1 2 3
2 A 1 2 3
3 A 1 2 3
4 A 1 2 3
5 A 1 2 3
6 B 2 3 4
7 B 2 3 4
8 B 2 3 4
9 C 3 4 5
10 C 3 4 5
Desired Output:
1 A 15
2 B 12
3 C 10
This works with dplyr instead of plyr.
# create data
df1 <- data.frame(
col1 = c(rep('A', 5), rep('B', 3), rep('C', 2)),
Jun = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3),
Jul = c(2, 2, 2, 2, 2, 3, 3, 3, 4, 4),
Aug = c(3, 3, 3, 3, 3, 4, 4, 4, 5, 5))
variable = 'Aug 2017'
# load dplyr library
library(dplyr)
# summarize each column that matches some string
df1 %>%
select(col1, matches(substr(variable, 1, 3))) %>%
group_by(col1) %>%
summarize_each(funs = 'sum')
# A tibble: 3 × 2
col1 Aug
<fctr> <dbl>
1 A 15
2 B 12
3 C 10
I also highly recommend reading about nonstandard and standard evaluation, here:
http://adv-r.had.co.nz/Computing-on-the-language.html

Resources