How to order grouped rows while keeping duplicates together [duplicate] - r

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 2 years ago.
I have a dataframe with several "people".
There are repeat instances for "people", however, the measured "value" is different in each instance.
Here is an example of dataframe.
df2 <- data.frame(
value = c(1, 2, 3, 4, 5),
people = c("d", "c", "b", "d", "b")
)
which looks like:
value people
1 d
2 c
3 b
4 d
5 b
I would like to group the data by "people", then sort the groups of rows by "value", and within the groups, I would like to sort descending by the "value".
That is, I want to keep duplicates together while sorting by value.
Here is how I would like the data to look:
value people
1 d
4 d
2 c
3 b
5 b
I have tried multiple attempts with group_by and arrange using {dplyr} but seems I am missing something.
Thanks for the help.
I have made a change - for clarity, I do not want "people" sorted alphabetically - this is a schedule in reality - person D has the first appointment (1), and his second appointment is 4. I want them to appear first and together. Person C has a 2nd appointment. Person B has a 3rd appointment, his other appointment is 5. I hope this makes it more clear. Thanks again

You can use arrange in this form :
library(dplyr)
df2 %>%
arrange(value) %>%
arrange(match(people, unique(people)))
# value people
#1 1 d
#2 4 d
#3 2 c
#4 3 b
#5 5 b

Though a longer code, but this will also work
df2 %>% group_by(people) %>% arrange(value) %>%
mutate(d = first(value)) %>% arrange(d) %>% ungroup() %>% select(-d)
# A tibble: 5 x 2
value people
<dbl> <chr>
1 1 d
2 4 d
3 2 c
4 3 b
5 5 b

I got your result with the following one-liner base-R code:
df2[order(df2$people, decreasing = TRUE),]
# value people
# 1 1 d
# 4 4 d
# 2 2 c
# 3 3 b
# 5 5 b

Related

count number of combinations by group

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

How to iterate column values to find out all possible combinations in R? [duplicate]

This question already has answers here:
Count common sets of items between different customers
(4 answers)
Intersect all possible combinations of list elements
(3 answers)
Closed 1 year ago.
Suppose you have a dataframe with ids and elements prescripted to each id. For example:
example <- data.frame(id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c',
'd','f','g','h','a','k','l','m', 'a',
'b', 'c'))
I want to find all possible pair combinations. The main struggle here is not the functional of R language that I can use, but the logic. How can I iterate through all elements and find patterns? For instance, a was picked with b 3 times in my sample dataframe. But original dataframe is more than 30k rows, so I cannot count these combinations manually. How do I automatize this process of finding the number of picks of each elements?
I was thinking about widening my df with pivot_wider and then using map_lgl to find matches. Then I faced the problem that it will take a lot of time for me to find all possible combinations, applying map_lgl for every pair of elements.
I was asking nearly the same question less than a month ago, fellow users answered it but the result is not anything I really need.
Do you have any ideas how to create a dataframe with all possible combinations of values for all ids?
I understand that this code is slow, but here is another example code to get the expected output based on tidyverse package.
What I do here is first create a nested dataframe by id, then produce all pair combinations for each id, unnest the dataframe, and finally count the pairs.
library(tidyverse)
example <- data.frame(
id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c','d','f','g','h','a','k','l','m','a','b', 'c')
)
example %>% nest(dataset=-id) %>% mutate(dataset=map(dataset, function(dataset){
if(nrow(dataset)>1){
dataset %>% .$vals %>% combn(., 2) %>% t() %>% as_tibble(.name_repair=~c("val1", "val2")) %>% return()
}else{
return(NULL)
}
})) %>% unnest(cols=dataset) %>% group_by(val1, val2) %>% summarize(n=n(), .groups="drop") %>% arrange(desc(n), val1, val2)
#> # A tibble: 34 x 3
#> val1 val2 n
#> <chr> <chr> <int>
#> 1 a b 3
#> 2 a c 2
#> 3 a d 2
#> 4 b c 2
#> 5 b d 2
#> 6 a e 1
#> 7 a k 1
#> 8 a l 1
#> 9 b e 1
#> 10 c d 1
#> # … with 24 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)
This won't (can't) be fast for many IDs. If it is too slow, you need to parallelize or implement it in a compiled language (e.g., using Rcpp).
We sort vals. We can then create all combination of two items grouped by ID. We exclude ID's with 1 item. Finally we tabulate the result.
library(data.table)
setDT(example)
setorder(example, id, vals)
example[, if (.N > 1) split(combn(vals, 2), 1:2), by = id][, .N, by = c("1", "2")]
# 1 2 N
# 1: a b 3
# 2: a c 2
# 3: a d 3
# 4: a e 1
# 5: b c 2
# 6: b d 2
# 7: b e 1
#<...>

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

R IDs error checking (different names, same ID)

I have data with a list of people's names and their ID numbers. Not all people with the same name will have the same ID number but everyone with different names should have a different ID number. Like this:
Name david david john john john john megan bill barbara chris chris
ID 1 1 2 2 2 3 4 5 6 7 8
I need to make sure that these IDs are correct. So, I want to write a code that says "subset only if ID numbers are the same but their names are different"(so I will be only subsetting ID errors). I don't even know where to start with this because I tried
df1<-df(subset(duplicated(df$Name) & duplicated(df$ID)))
Error in subset.default(duplicated(df$officer) & duplicated(df$ID)) :
argument "subset" is missing, with no default
but it didn't work and I know it doesn't tell R to match and compare names and ID numbers.
Thank you so much in advance.
Updated with the information in the comments below
Here are some test data:
> DF <- data.frame(name = c("A", "A", "A", "B", "B", "C"), id=c(1,1,2,3,4,4))
> DF
name id
1 A 1
2 A 1
3 A 2
4 B 3
5 B 4
6 C 4
So ... if I understand your problem correctly you want to get the information that there are problems with id 4 since two different names (B and C) appear for that id.
library(dplyr)
DF %>% group_by(id) %>% distinct(name) %>% tally()
# A tibble: 4 x 2
id n
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 2
Here we get a summary and see that there are two different names (n) for id 4. You can combine that with filter to only see the ids with more than one name
> DF %>% group_by(id) %>% distinct(name) %>% tally() %>% filter(n > 1)
# A tibble: 1 x 2
id n
<dbl> <int>
1 4 2
Did that help?

Product of several columns on a data frame by a vector using dplyr

I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)

Resources