How to create a new variable which is the union of two other variables using dplyr? - r

I have this data.frame:
df = data.frame(a = c(1,1,2,2,3,3), b = c(1:6), c = c(1,2,3,5,7,8))
a b c
-----
1 1 1
1 2 2
2 3 3
2 4 5
3 5 7
3 6 8
I want for each value in variable a, to keep only a new variable d, the unique union of variables b and c:
a d
---
1 1
1 2
2 3
2 4
2 5
3 5
3 6
3 7
3 8
Something like this will of course return an error:
library(dplyr)
df %>%
group_by(a) %>%
mutate(d = union(b, c))
Does anyone have an elegant solution? Thanks!

I would suggest "data.table" for this:
library(data.table)
unique(as.data.table(df)[, list(d = unlist(.SD)), by = a])
# a d
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 3 5
# 7: 3 6
# 8: 3 7
# 9: 3 8
I suppose a similar approach in "dplyr" would be to also use "tidyr", like this:
library(dplyr)
library(tidyr)
df %>%
gather(var, d, b:c) %>%
select(-var) %>%
unique
# a d
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 3 5
# 6 3 6
# 10 2 5
# 11 3 7
# 12 3 8

Related

Nested list to grouped rows in R

I have the following nested list called l (dput below):
> l
$A
$A$`1`
[1] 1 2 3
$A$`2`
[1] 3 2 1
$B
$B$`1`
[1] 2 2 2
$B$`2`
[1] 3 4 3
I would like to convert this to a grouped dataframe where A and B are the first group column and 1 and 2 are the subgroups with respective values. The desired output should look like this:
group subgroup values
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
As you can see A and B are the main group and 1 and 2 are the subgroups. Using purrr::flatten(l) or unnest doesn't work. So I was wondering if anyone knows how to convert a nested list to a grouped row dataframe?
dput of l:
l <- list(A = list(`1` = c(1, 2, 3), `2` = c(3, 2, 1)), B = list(`1` = c(2,
2, 2), `2` = c(3, 4, 3)))
Using stack and rowbind with id:
data.table::rbindlist(lapply(l, stack), idcol = "id")
# id values ind
# 1: A 1 1
# 2: A 2 1
# 3: A 3 1
# 4: A 3 2
# 5: A 2 2
# 6: A 1 2
# 7: B 2 1
# 8: B 2 1
# 9: B 2 1
# 10: B 3 2
# 11: B 4 2
# 12: B 3 2
You can use enframe() to convert the list into a data.frame, and unnest the value column twice.
library(tidyr)
tibble::enframe(l, name = "group") %>%
unnest_longer(value, indices_to = "subgroup") %>%
unnest(value)
# A tibble: 12 × 3
group value subgroup
<chr> <dbl> <chr>
1 A 1 1
2 A 2 1
3 A 3 1
4 A 3 2
5 A 2 2
6 A 1 2
7 B 2 1
8 B 2 1
9 B 2 1
10 B 3 2
11 B 4 2
12 B 3 2
Turn the list directly into a data frame, then pivot it into a long format and arrange to your desired order.
library(tidyverse)
lst %>%
as.data.frame() %>%
pivot_longer(everything(), names_to = c("group", "subgroup"),
values_to = "values",
names_pattern = "(.+?)\\.(.+?)") %>%
arrange(group, subgroup)
# A tibble: 12 × 3
group subgroup values
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
You can combine rrapply with unnest, which has the benefit to work in lists of arbitrary lengths:
library(rrapply)
library(tidyr)
rrapply(l, how = "melt") |>
unnest(value)
# A tibble: 12 × 3
L1 L2 value
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3

R data.table group by continuous values

I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11

Expand dataframe by ID to generate a special column

I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7

Count distinct values that are not the same as the current row's values

Suppose I have a data frame:
df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))
which comes out to something like this:
SID Var1 Var2
1 4 A 2
2 3 A 2
3 4 A 3
4 3 A 3
5 1 A 4
6 1 B 2
7 3 B 2
8 4 B 4
9 4 B 4
10 3 B 2
11 2 C 2
12 2 C 2
13 4 C 4
14 2 C 4
15 3 C 3
What I hope to accomplish is to find the count of unique SIDs (see below under update, this should have said count of unique (SID, Var1) combinations) where the given row's Var1 is excluded from this count and the count is grouped on Var2. So for the example above, I would like to output:
SID Var1 Var2 Count.Excluding.Var1
1 4 A 2 3
2 3 A 2 3
3 4 A 3 1
4 3 A 3 1
5 1 A 4 3
6 1 B 2 3
7 3 B 2 3
8 4 B 4 3
9 4 B 4 3
10 3 B 2 3
11 2 C 2 4
12 2 C 2 4
13 4 C 4 2
14 2 C 4 2
15 3 C 3 2
For the 1st observation, we have a count of 3 because there are 3 unique combinations of (SID, Var1) for the given Var2 value (2, in this case) where Var1 != A (Var1 value of 1st observation) -- specifically, the count includes observation 6, 7 and 11, but not 12 because we already accounted for a (SID, Var1)=(2,C) and not row 2 because we do not want Var1 to be "A". All of these rows have the same Var2 value.
I'd preferably like to use dplyr functions and the %>% operator.
&
UPDATE
I apologize for the confusion and my incorrect explanation above. I have corrected what I intended on asking for in the paranthesis, but I am leaving my original phrasing as well because majority of answers seem to interpret it this way.
As for the example, I apologize for not setting the seed. There seems to have been some confusion with regards to the Count.Excluding.Var1 for rows 11 and 12. With unique (SID, Var1) combinations, rows 11 and 12 should make sense as these count rows 1,2,6, and 7 xor 8.
A simple mapply can do the trick. But as OP requested for %>% based solution, an option could be as:
df %>% mutate(Count.Excluding.Var1 =
mapply(function(x,y)nrow(unique(df[df$Var1 != x & df$Var2 == y,1:2])),.$Var1,.$Var2))
# SID Var1 Var2 Count.Excluding.Var1
# 1 4 A 2 3
# 2 2 A 3 3
# 3 4 A 4 3
# 4 4 A 4 3
# 5 3 A 4 3
# 6 4 B 3 1
# 7 3 B 3 1
# 8 3 B 3 1
# 9 4 B 2 3
# 10 2 B 3 1
# 11 2 C 2 2
# 12 4 C 4 2
# 13 1 C 4 2
# 14 1 C 2 2
# 15 3 C 4 2
Data:
The above results are based on origional data provided by OP.
df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))
could not think of a dplyr solution, but here's one with apply
df$Count <- apply(df, 1, function(x) length(unique(df$SID[(df$Var1 != x['Var1']) & (df$Var2 == x['Var2'])])))
# SID Var1 Var2 Count
# 1 4 A 2 3
# 2 3 A 2 3
# 3 4 A 3 1
# 4 3 A 3 1
# 5 1 A 4 2
# 6 1 B 2 3
# 7 3 B 2 3
# 8 4 B 4 3
# 9 4 B 4 3
# 10 3 B 2 3
# 11 2 C 2 3
# 12 2 C 2 3
# 13 4 C 4 2
# 14 2 C 4 2
# 15 3 C 3 2
Here is a dplyr solution, as requested. For future reference, please use set.seed so we can reproduce your desired output with sample, else I have to enter data by hand...
I think this is your logic? You want the n_distinct(SID) for each Var2, but for each row, you want to exclude rows which have the same Var1 as the current row. So a key observation here is row 3, where a simple grouped summarise would yield a count of 2. Of the rows with Var2 = 3, row 3 has SID = 4, row 4 has SID = 3, row 15 has SID = 3, but we don't count row 3 or row 4, so final count is one unique SID.
Here we get first the count of unique SID for each Var2, then the count of unique SID for each Var1, Var2 combo. First count is too large by the amount of additional unique SID for each combo, so we subtract it and add one. There is an edge case where for a Var1, there is only one corresponding Var2. This should return 0 since you exclude all the possible values of SID. I added two rows to illustrate this.
library(tidyverse)
df <- read_table2(
"SID Var1 Var2
4 A 2
3 A 2
4 A 3
3 A 3
1 A 4
1 B 2
3 B 2
4 B 4
4 B 4
3 B 2
2 C 2
2 C 2
4 C 4
2 C 4
3 C 3
1 D 5
2 D 5"
)
df %>%
group_by(Var2) %>%
mutate(SID_per_Var2 = n_distinct(SID)) %>%
group_by(Var1, Var2) %>%
mutate(SID_per_Var1Var2 = n_distinct(SID)) %>%
ungroup() %>%
add_count(Var1) %>%
add_count(Var1, Var2) %>%
mutate(
Count.Excluding.Var1 = if_else(
n > nn,
SID_per_Var2 - SID_per_Var1Var2 + 1,
0
)
) %>%
select(SID, Var1, Var2, Count.Excluding.Var1)
#> # A tibble: 17 x 4
#> SID Var1 Var2 Count.Excluding.Var1
#> <int> <chr> <int> <dbl>
#> 1 4 A 2 3.
#> 2 3 A 2 3.
#> 3 4 A 3 1.
#> 4 3 A 3 1.
#> 5 1 A 4 3.
#> 6 1 B 2 3.
#> 7 3 B 2 3.
#> 8 4 B 4 3.
#> 9 4 B 4 3.
#> 10 3 B 2 3.
#> 11 2 C 2 4.
#> 12 2 C 2 4.
#> 13 4 C 4 2.
#> 14 2 C 4 2.
#> 15 3 C 3 2.
#> 16 1 D 5 0.
#> 17 2 D 5 0.
Created on 2018-04-12 by the reprex package (v0.2.0).
Here's a solution using purrr - you can wrap this in a mutate statement if you want, but I don't know that it adds much in this particular case.
library(purrr)
df$Count.Excluding.Var1 = map_int(1:nrow(df), function(n) {
df %>% filter(Var2 == Var2[n], Var1 != Var1[n]) %>% distinct() %>% nrow()
})
(Updated with input from comments by Calum You. Thanks!)
A 100% tidyverse solution:
library(tidyverse) # dplyr + purrr
df %>%
group_by(Var2) %>%
mutate(count = map_int(Var1,~n_distinct(SID[.x!=Var1],Var1[.x!=Var1])))
# # A tibble: 15 x 4
# # Groups: Var2 [3]
# SID Var1 Var2 count
# <int> <chr> <int> <int>
# 1 4 A 2 3
# 2 3 A 2 3
# 3 4 A 3 1
# 4 3 A 3 1
# 5 1 A 4 3
# 6 1 B 2 3
# 7 3 B 2 3
# 8 4 B 4 3
# 9 4 B 4 3
# 10 3 B 2 3
# 11 2 C 2 4
# 12 2 C 2 4
# 13 4 C 4 2
# 14 2 C 4 2
# 15 3 C 3 2

how to mutate a column with ID in group

how to mutate a column with ID in group
data.frame like:
a b c
1 a 1 1
2 a 1 2
3 a 2 3
4 b 1 4
5 b 2 5
6 b 3 6
group by a, flag start with 1, if b equals pre b,then flag=1 else flag+=1
a b c flag
1 a 1 1 1 <- group a start with 1
2 a 1 2 1 <-- in group a, 1(in row 2)=1(in row 1)
3 a 2 3 2 <- in group a, 2(in row 3)!=1(in row 2)
4 b 1 4 1 <- group b start with 1
5 b 2 5 2 <- in group b, 2(in row 5)!=1(in row 4)
6 b 3 6 3 <- in group b, 3(in row 6)!=2(in row 5)
i now using this:
for(i in 2:nrow(x)){
x[i, 'flag'] = ifelse(x[i, 'a']!=x[i-1,'a'], 1, ifelse(x[i, 'b']==x[i-1, 'b'], x[i-1, 'flag'], x[i-1,'flag']+1))
}
but it is inefficiency in large dataset
#
UPDATE
dense_rank in dplyr give me the answer
> x %>% group_by(a) %>% mutate(dense_rank(b))
Source: local data frame [10 x 4]
Groups: a
a b c dense_rank(b)
1 a x 1 1
2 a x 2 1
3 a y 3 2
4 b x 4 1
5 b y 5 2
6 b z 6 3
7 c x 7 1
8 c y 8 2
9 c z 9 3
10 c z 10 3
thanks.
I am not entirely sure what you are trying to do. But it seems to me that you are trying to assign index numbers to values in b for each group (a or b).
#I modified your example here.
a <- rep(c("a","b"), each =3)
b <- c(4,4,5,11,12,13)
c <- 1:6
foo <- data.frame(a,b,c, stringsAsFactors = F)
a b c
1 a 4 1
2 a 4 2
3 a 5 3
4 b 11 4
5 b 12 5
6 b 13 6
#Since you referred to dplyr, I will use it.
cats <- list()
for(i in unique(foo$a)){
ana <- foo %>%
filter(a == i) %>%
arrange(b) %>%
mutate(indexInb = as.integer(as.factor(b)))
cats[[i]] <- ana
}
bob <- rbindlist(cats)
a b c indexInb
1: a 4 1 1
2: a 4 2 1
3: a 5 3 2
4: b 11 4 1
5: b 12 5 2
6: b 13 6 3
Hers's a quick vectorized way to solve this without using any for loops
Base R solution using ave and transform
transform(x, flag = ave(b, a, FUN = function(x) cumsum(c(1, diff(x)))))
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
Or a data.table solution (more efficient)
library(data.table)
setDT(x)[, flag := cumsum(c(1, diff(b))), by = a]
x
# a b c flag
# 1: a 1 1 1
# 2: a 1 2 1
# 3: a 2 3 2
# 4: b 1 4 1
# 5: b 2 5 2
# 6: b 3 6 3
Or a dplyr solution (because you tagged it)
library(dplyr)
x %>%
group_by(a) %>%
mutate(flag = cumsum(c(1, diff(b))))
# Source: local data frame [6 x 4]
# Groups: a
#
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3

Resources