Grouping similar elements together

Grouping similar elements together - r

I am trying to group similar entities together and can't find an easy way to do so.
For example, here is a table:
Names Initial_Group Final_Group
1 James,Gordon 6 A
2 James,Gordon 6 A
3 James,Gordon 6 A
4 James,Gordon 6 A
5 James,Gordon 6 A
6 James,Gordon 6 A
7 Amanda 1 A
8 Amanda 1 A
9 Amanda 1 A
10 Gordon,Amanda 5 A
11 Gordon,Amanda 5 A
12 Gordon,Amanda 5 A
13 Gordon,Amanda 5 A
14 Gordon,Amanda 5 A
15 Gordon,Amanda 5 A
16 Gordon,Amanda 5 A
17 Gordon,Amanda 5 A
18 Edward,Gordon,Amanda 4 A
19 Edward,Gordon,Amanda 4 A
20 Edward,Gordon,Amanda 4 A
21 Anna 2 B
22 Anna 2 B
23 Anna 2 B
24 Anna,Leonard 3 B
25 Anna,Leonard 3 B
26 Anna,Leonard 3 B
I am unsure how to get the 'Final_Group' field, in the table above.
For that, I need to assign any element that has any connections to another element, and group them together:
For example, rows 1 to 20 needs to be grouped together because they are all connected by at least one or more elements.
So for rows 1 to 6, 'James, Gordon' appear, and since "Gordon" is in rows 10:20, they all have to be grouped. Likewise, since 'Amanda' appears in rows 7:9, these have to be grouped with "James,Gordon", "Gordon, Amanda", and "Edward, Gordon, Amanda".
Below is code to generate the initial data:
# Manually generating data
Names <- c(rep('James,Gordon',6)
,rep('Amanda',3)
,rep('Gordon,Amanda',8)
,rep('Edward,Gordon,Amanda',3)
,rep('Anna',3)
,rep('Anna,Leonard',3))
Initial_Group <- rep(1:6,c(6,3,8,3,3,3))
Final_Group <- rep(c('A','B'),c(20,6))
data <- data.frame(Names,Initial_Group,Final_Group)
# Grouping
data %>%
select(Names) %>%
mutate(Initial_Group=group_indices(.,Names))
Does anyone know of anyway to do this in R?

This is a long one but you could do:
library(tidyverse)
library(igraph)
df %>%
select(Names)%>%
distinct() %>%
separate(Names, c('first', 'second'), extra = 'merge', fill = 'right')%>%
separate_rows(second) %>%
mutate(second = coalesce(second, as.character(cumsum(is.na(second)))))%>%
graph_from_data_frame()%>%
components()%>%
getElement('membership')%>%
imap(~str_detect(df$Names, .y)*.x) %>%
invoke(pmax, .)%>%
cbind(df, value = LETTERS[.], value1 = .)
Names Initial_Group Final_Group value value1
1 James,Gordon 6 A A 1
2 James,Gordon 6 A A 1
3 James,Gordon 6 A A 1
4 James,Gordon 6 A A 1
5 James,Gordon 6 A A 1
6 James,Gordon 6 A A 1
7 Amanda 1 A A 1
8 Amanda 1 A A 1
9 Amanda 1 A A 1
10 Gordon,Amanda 5 A A 1
11 Gordon,Amanda 5 A A 1
12 Gordon,Amanda 5 A A 1
13 Gordon,Amanda 5 A A 1
14 Gordon,Amanda 5 A A 1
15 Gordon,Amanda 5 A A 1
16 Gordon,Amanda 5 A A 1
17 Gordon,Amanda 5 A A 1
18 Edward,Gordon,Amanda 4 A A 1
19 Edward,Gordon,Amanda 4 A A 1
20 Edward,Gordon,Amanda 4 A A 1
21 Anna 2 B B 2
22 Anna 2 B B 2
23 Anna 2 B B 2
24 Anna,Leonard 3 B B 2
25 Anna,Leonard 3 B B 2
26 Anna,Leonard 3 B B 2
Check the column called value

I was wrong that I misunderstood that you're focus on Final_Group. If not, please let me know
My approach is based on distance between samples.
data <- data %>%
mutate(Names = sapply(Names, function(x) as.vector(str_split(x, ","))))
for (i in c(1:26)){
data$James[i] = ("James" %in% data$Names[[i]])
data$Gordon[i] = ("Gordon" %in% data$Names[[i]])
data$Amanda[i] = ("Amanda" %in% data$Names[[i]])
data$Edward[i] = ("Edward" %in% data$Names[[i]])
data$Anna[i] = ("Anna" %in% data$Names[[i]])
dummy$Leonard[i] = ("Leonard" %in% dummy$Names[[i]])
}
hc <- data%>% select(-Names,) %>%
select(-Final_Group, -Initial_Group ) %>%
dist() %>% hclust(.,method = "complete")
cutree(hc)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
plot(hc)
now that's similar to Final_Group

Related

Use apply to create a list of adjacency matrices from dataframe in R

I have an edgelist of friendships with 5 different schools over 3 waves. I'd like to create a list for each school that contains 3 adjacency matrices (one for each wave). I can do this one by one, but I would like to use a loop or an apply function to automate it.
This is the code I have used for one school and wave:
school1_w1 <- filter(edges, school == 1 & wave == 1) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1_w2 <- filter(edges, school == 1 & wave == 2) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1_w3 <- filter(edges, school == 1 & wave == 3) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1 <- list(school1_w1, school1_w2, school1_w3)
How can I do this for all 5 schools with an apply or loop? Sample data below:
ego alter wave school
1 4 1 1
1 4 2 1
1 3 3 1
2 3 1 1
2 4 2 1
2 4 3 1
3 1 1 1
3 2 2 1
3 3 3 1
4 1 1 1
4 1 2 1
4 1 3 1
5 8 1 2
5 6 2 2
5 7 3 2
6 7 1 2
6 7 2 2
6 7 3 2
7 8 1 2
7 6 2 2
7 6 3 2
8 7 1 2
8 7 2 2
8 7 3 2
9 10 1 3
9 11 2 3
9 12 3 3
10 11 1 3
10 11 2 3
10 9 3 3
11 12 1 3
11 10 2 3
11 12 3 3
12 9 1 3
12 10 2 3
12 10 3 3
13 14 1 4
13 15 2 4
13 16 3 4
14 16 1 4
14 16 2 4
14 13 3 4
15 16 1 4
15 16 2 4
15 16 3 4
16 15 1 4
16 15 2 4
16 15 3 4
17 20 1 5
17 18 2 5
17 18 3 5
18 19 1 5
18 20 2 5
18 19 3 5
19 17 1 5
19 17 2 5
19 17 3 5
20 18 1 5
20 17 2 5
20 17 3 5

We can use split + lapply :
library(igraph)
result <- lapply(split(edges, list(edges$school, edges$wave)), function(x) {
graph_from_data_frame(x, directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
})
Or with by :
result <- by(edges, list(edges$school, edges$wave), function(x) {
graph_from_data_frame(x, directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
})

R: How to split a row in a dataframe into a number of rows, conditional on a value in a cell?

I have a data.frame which looks like the following:
id <- c("a","a","a","a","b","b","b","b")
age_from <- c(0,2,3,7,0,1,2,6)
age_to <- c(2,3,7,10,1,2,6,10)
y <- c(100,150,100,250,300,200,100,150)
df <- data.frame(id,age_from,age_to,y)
df$years <- df$age_to - df$age_from
Which gives a df that looks like:
id age_from age_to y years
1 a 0 2 100 2
2 a 2 3 150 1
3 a 3 7 100 4
4 a 7 10 250 3
5 b 0 1 300 1
6 b 1 2 200 1
7 b 2 6 100 4
8 b 6 10 150 4
Instead of having an unequal number of years per row, I would like to have 20 rows, 10 for each id, with each row accounting for one year. This would also involve averaging the y column across the number of years listed in the years column.
I believe this may have to be done using a loop 1:n with the n equaling a value in the years column. Although I am not sure how to start with this.

You can use rep to repeat the rows by the number of given years.
x <- df[rep(seq_len(nrow(df)), df$years),]
x
# id age_from age_to y years
#1 a 0 2 50.00000 2
#1.1 a 0 2 50.00000 2
#2 a 2 3 150.00000 1
#3 a 3 7 25.00000 4
#3.1 a 3 7 25.00000 4
#3.2 a 3 7 25.00000 4
#3.3 a 3 7 25.00000 4
#4 a 7 10 83.33333 3
#4.1 a 7 10 83.33333 3
#4.2 a 7 10 83.33333 3
#5 b 0 1 300.00000 1
#6 b 1 2 200.00000 1
#7 b 2 6 25.00000 4
#7.1 b 2 6 25.00000 4
#7.2 b 2 6 25.00000 4
#7.3 b 2 6 25.00000 4
#8 b 6 10 37.50000 4
#8.1 b 6 10 37.50000 4
#8.2 b 6 10 37.50000 4
#8.3 b 6 10 37.50000 4
When you mean with averaging the y column across the number of years to divide by the number of years:
x$y <- x$y / x$years
In case age_from should go from 0 to 9 and age_to from 1 to 10 for each id:
x$age_from <- x$age_from + ave(x$age_from, x$id, x$age_from, FUN=seq_along) - 1
#x$age_from <- ave(x$age_from, x$id, FUN=seq_along) - 1 #Alternative
x$age_to <- x$age_from + 1

Here is a solution with tidyr and dplyr.
First of all we complete age_from from 0 to 9 as you wanted, by keeping only the existing ids.
You will have several NAs on age_to, y and years. So, we fill them by dragging down each value in order to complete the immediately following values that are NA.
Now you can divide y by years (I assumed you meant this by setting the average value so to leave the sum consistent).
At that point, you only need to recalculate age_to accordingly.
Remember to ungroup at the end!
library(tidyr)
library(dplyr)
df %>%
complete(id, age_from = 0:9) %>%
group_by(id) %>%
fill(y, years, age_to) %>%
mutate(y = y/years) %>%
mutate(age_to = age_from + 1) %>%
ungroup()
# A tibble: 20 x 5
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4

A tidyverse solution.
library(tidyverse)
df %>%
mutate(age_to = age_from + 1) %>%
group_by(id) %>%
complete(nesting(age_from = 0:9, age_to = 1:10)) %>%
fill(y, years) %>%
mutate(y = y / years)
# A tibble: 20 x 5
# Groups: id [2]
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4

R Selecting highest count cells conditional on two columns

Apologies, if this is a duplicate please let me know, I'll gladly delete.
I am attempting to select the four highest values for different values of another column.
Dataset:
A B COUNT
1 1 2 2
2 1 3 6
3 1 4 3
4 1 5 9
5 1 6 2
6 1 7 7
7 1 8 0
8 1 9 5
9 1 10 2
10 1 11 7
11 2 1 5
12 2 3 1
13 2 4 8
14 2 5 9
15 2 6 5
16 2 7 2
17 2 8 2
18 2 9 4
19 3 1 7
20 3 2 5
21 3 4 2
22 3 5 8
23 3 6 6
24 3 7 1
25 3 8 9
26 3 9 5
27 4 1 8
28 4 2 1
29 4 3 1
30 4 5 3
31 4 6 9
For example, I would like to select four highest counts when A=1 (9,7,7,6) then when A=2 (9,8,5,5) and so on...
I would also like the corresponding B column value to be beside each count, so for when A=1 my desired output would be something like:
B A Count
5 1 9
7 1 7
11 1 7
3 1 6
I have looked a various answers on 'selecting highest values' but was struggling to find an example conditioning on other columns.
Many thanks

We can do
df1 %>%
group_by(A) %>%
arrange(desc(COUNT)) %>%
filter(row_number() <5)

library(dplyr)
data %>% group_by(A) %>%
arrange(A, desc(COUNT)) %>%
slice(1:4)

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue

Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5

We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

How to find difference between values in two rows in an R dataframe using dplyr

I have an R dataframe such as:
df <- data.frame(period=rep(1:4,2),
farm=c(rep('A',4),rep('B',4)),
cumVol=c(1,5,15,31,10,12,16,24),
other = 1:8);
period farm cumVol other
1 1 A 1 1
2 2 A 5 2
3 3 A 15 3
4 4 A 31 4
5 1 B 10 5
6 2 B 12 6
7 3 B 16 7
8 4 B 24 8
How do I find the change in cumVol at each farm in each period, ignoring the 'other' column? I would like a dataframe like this (optionally with the cumVol column remaining):
period farm volume other
1 1 A 0 1
2 2 A 4 2
3 3 A 10 3
4 4 A 16 4
5 1 B 0 5
6 2 B 2 6
7 3 B 4 7
8 4 B 8 8
In practice there may be many 'farm'-like columns, and many 'other'-like (ie. ignored) columns. I'd like to be able to specify all the column names using variables.
I am using the dplyr package.

In dplyr:
require(dplyr)
df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = cumVol[1]))
Source: local data frame [8 x 5]
Groups: farm
period farm cumVol other volume
1 1 A 1 1 0
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 0
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8
Perhaps the desired output should actually be as follows?
df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))
period farm cumVol other volume
1 1 A 1 1 1
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 10
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8
Edit: Following up on your comments I think you are looking for arrange(). It that is not the case it might be best to start a new question.
df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) );
df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))
Edit: Follow up #2
df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) );
df <- df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))
ungroup(df) %>%
arrange(farm) %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))
Source: local data frame [8 x 4]
Groups: farm
period farm cumVol volume
1 1 A 12 12
2 2 A 20 8
3 3 A 40 20
4 4 A 62 22
5 1 B 30 30
6 2 B 34 4
7 3 B 42 8
8 4 B 58 16

In dplyr -- so you don't have to replace NAs
library(dplyr)
df %>%
group_by(farm)%>%
mutate(volume = c(0,diff(cumVol)))
period farm cumVol other volume
1 1 A 1 1 0
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 0
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8

Would creating a new column in your original dataset be an option?
Here is an option using the data.table operator :=.
require("data.table")
DT <- data.table(df)
DT[, volume := c(0,diff(cumVol)), by="farm"]
or
diff_2 <- function(x) c(0,diff(x))
DT[, volume := diff_2(cumVol), by="farm"]
Output:
# > DT
# period farm cumVol other volume
# 1: 1 A 1 1 0
# 2: 2 A 5 2 4
# 3: 3 A 15 3 10
# 4: 4 A 31 4 16
# 5: 1 B 10 5 0
# 6: 2 B 12 6 2
# 7: 3 B 16 7 4
# 8: 4 B 24 8 8

tapply and transform?
> transform(df, volumen=unlist(tapply(cumVol, farm, function(x) c(0, diff(x)))))
period farm cumVol other volumen
A1 1 A 1 1 0
A2 2 A 5 2 4
A3 3 A 15 3 10
A4 4 A 31 4 16
B1 1 B 10 5 0
B2 2 B 12 6 2
B3 3 B 16 7 4
B4 4 B 24 8 8
ave is a better option, see # thelatemail's comment
with(df, ave(cumVol,farm,FUN=function(x) c(0,diff(x))) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Grouping similar elements together - r

Related

Use apply to create a list of adjacency matrices from dataframe in R

R: How to split a row in a dataframe into a number of rows, conditional on a value in a cell?

R Selecting highest count cells conditional on two columns

Give unique identifier to consecutive groupings

How to find difference between values in two rows in an R dataframe using dplyr

Categories

Resources