Grouping cases with at least three variables in common in R - r

I have want to group my dataset by multiple variables and than id those groups. I can id groups when I only group by one variable using dplyr with group_indices.
But I want to group cases by having the same value on at least one of a certain set of variables and then identify the group cases belong to. How to do this in R?
I have the following dataset
NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9
I want cases to be grouped when they have at least one variable of the three I listed (name, adress, phonenumber) in common.
Cases with most in common to each other should be grouped over cases that have the least in common.
So I want to create a grouping variable which gives cases the same value if they're in the same group.
You can assume the hierarchy of name>address>phone
NPI name adress phone org
1 1 1 1 1
2 1 1 1 1
3 2 2 2 2
4 2 3 3 2
5 3 4 4 3
6 3 4 5 3
7 4 5 6 4
8 5 6 6 4
9 6 7 7 5
10 7 8 8 6
11 1 9 9 1
In the my real dataset I don't have numbers but names, actual addresses and phone numbers. So all the variables I'm working with are string variables.

Try this with dplyr:
library(dplyr)
df %>%
arrange(name, adress, phone) %>%
mutate(group = c(1, ifelse((name != lag(name)) & (adress != lag(adress)) & (phone != lag(phone)), 1, 0)[-1]),
group = cumsum(group)) %>%
arrange(NPI)
Result:
NPI name adress phone group
1 1 1 1 1 1
2 2 1 1 1 1
3 3 2 2 2 2
4 4 2 3 3 2
5 5 3 4 4 3
6 6 3 4 5 3
7 7 4 5 6 4
8 8 5 6 6 4
9 9 6 7 7 5
10 10 7 8 8 6
11 11 1 9 9 1
Note:
This works even if name, adress, and phone are all characters. As long as and id column (NPI) is numeric, the final data.frame would be in the correct order.
Data:
df = read.table(text = " NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9 ", header = TRUE)
library(dplyr)
df = df %>% mutate_at(vars(-NPI), as.character)

Related

Merge 2 rows with duplicated pair of values into a single row

I have the dataframe below in which there are 2 rows with the same pair of values for columns A and B -3RD AND 4RTH with 2 3 -, -7TH AND 8TH with 4 6-.
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
A B C D
1 1 1 5 1
2 1 2 2 2
3 2 3 5 5
4 2 3 7 3
5 3 4 7 7
6 3 5 5 5
7 4 6 7 9
8 4 6 9 6
9 5 7 7 7
10 5 8 8 0
I would like to merge these rows into one by adding the pipe | operator between values of C and D. The 2nd and 3rd line for example would be like:
A B C D
2 3 2|5 2|5
I think your combined pairs are off by a row in your example, assuming that's the case, this is what you're looking for. We group by the columns we want to collapse the duplicates out of, and then use summarize_all with paste0 to combine the values with a separator.
library(tidyverse)
master %>% group_by(A,B) %>% summarize_all(funs(paste0(., collapse="|")))
A B C D
<dbl> <dbl> <chr> <chr>
1 1 1 5 1
2 1 2 2 2
3 2 3 5|7 5|3
4 3 4 7 7
5 3 5 5 5
6 4 6 7|9 9|6
7 5 7 7 7
8 5 8 8 0
We can do this in base R with aggregate
aggregate(.~ A + B, master, FUN = paste, collapse= '|')
# A B C D
#1 1 1 5 1
#2 1 2 2 2
#3 2 3 5|7 5|3
#4 3 4 7 7
#5 3 5 5 5
#6 4 6 7|9 9|6
#7 5 7 7 7
#8 5 8 8 0

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

group cases by shared values in r [duplicate]

This question already has answers here:
R: define distinct pattern from values of multiple variables [duplicate]
(3 answers)
Closed 5 years ago.
I have a dataset like this:
case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3
I would like to create a grouping variable.
This variable should have the same values when both x and y are the same.
I do not care what this value is but it is to group them. Because in my dataset if x and y are the same for two cases they are probably part of the same organization. I want to see which organizations there are.
So my preferred dataset would look like this:
case x y org
1 4 5 1
2 4 5 1
3 8 9 2
4 7 9 3
5 6 3 4
6 6 3 4
How would I have to program this in R?
As you said , I do not care what this value is, you can just do following
dt$new=as.numeric(as.factor(paste(dt$x,dt$y)))
dt
case x y new
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
A solution from dplyr using the group_indices.
library(dplyr)
dt2 <- dt %>%
mutate(org = group_indices(., x, y))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
If the group numbers need to be in order, we can use the rleid from the data.table package after we create the org column as follows.
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(org = group_indices(., x, y)) %>%
mutate(org = rleid(org))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 2
4 4 7 9 3
5 5 6 3 4
6 6 6 3 4
Update
Here is how to arrange the columns in dplyr.
library(dplyr)
dt %>%
arrange(x)
case x y
1 1 4 5
2 2 4 5
3 5 6 3
4 6 6 3
5 4 7 9
6 3 8 9
We can also do this for more than one column, such as arrange(x, y) or use desc to reverse the oder, like arrange(desc(x)).
DATA
dt <- read.table(text = " case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3",
header = TRUE)

R Selecting highest count cells conditional on two columns

Apologies, if this is a duplicate please let me know, I'll gladly delete.
I am attempting to select the four highest values for different values of another column.
Dataset:
A B COUNT
1 1 2 2
2 1 3 6
3 1 4 3
4 1 5 9
5 1 6 2
6 1 7 7
7 1 8 0
8 1 9 5
9 1 10 2
10 1 11 7
11 2 1 5
12 2 3 1
13 2 4 8
14 2 5 9
15 2 6 5
16 2 7 2
17 2 8 2
18 2 9 4
19 3 1 7
20 3 2 5
21 3 4 2
22 3 5 8
23 3 6 6
24 3 7 1
25 3 8 9
26 3 9 5
27 4 1 8
28 4 2 1
29 4 3 1
30 4 5 3
31 4 6 9
For example, I would like to select four highest counts when A=1 (9,7,7,6) then when A=2 (9,8,5,5) and so on...
I would also like the corresponding B column value to be beside each count, so for when A=1 my desired output would be something like:
B A Count
5 1 9
7 1 7
11 1 7
3 1 6
I have looked a various answers on 'selecting highest values' but was struggling to find an example conditioning on other columns.
Many thanks
We can do
df1 %>%
group_by(A) %>%
arrange(desc(COUNT)) %>%
filter(row_number() <5)
library(dplyr)
data %>% group_by(A) %>%
arrange(A, desc(COUNT)) %>%
slice(1:4)

Removing rows on column value by ID in R

Apologies if this is posted elsewhere I did searches here and elsewhere and found things that were close but not quite what I needed. After sinking a couple hours into this, I'm posting!
I need to remove rows from a data set for duplicate values in value1 by id. So in the following data frame I'd only want to remove row 3. I do not want to remove row 10 or row 9. If it makes a difference, in the actual date the values are dates.
I know the solution is probably very simple but I've yet to get it exactly right. Thanks!
x <- data.frame(cbind(id=c(1,2,2,2,3,3,4,5,6,6), value1=c(6,8,8,1,9,5,4,3,8,4), value2=1:10))
> x
id value1 value2
1 1 6 1
2 2 8 2
3 2 8 3
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10
I want to end up with:
> x
id value1 value2
1 1 6 1
2 2 8 2
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10
Try duplicated:
> x[!duplicated(x[1:2]), ]
id value1 value2
1 1 6 1
2 2 8 2
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10

Resources