Remove groups with only one individual in R [duplicate] - r

This question already has an answer here:
Select groups with more than one distinct value per group [duplicate]
(1 answer)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.

You could make a lookup table to identify the groups that have more than one unique individualID (similar to what you did with aggregate), then filter df1 based on that:
library(dplyr)
lookup <- df1 %>%
group_by(group) %>%
summarise(count = n_distinct(individualID)) %>%
filter(count > 1)
df1 %>% filter(group %in% unique(lookup$group))
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 3 5 1
6 3 5 1
7 3 6 1
8 3 6 1
Or, as #MrGumble suggests above, you could also merge df1 after creating lookup:
merge(df1, lookup)
group individualID X count
1 1 1 0 2
2 1 1 1 2
3 1 2 1 2
4 1 2 1 2
5 3 6 1 2
6 3 6 1 2
7 3 5 1 2
8 3 5 1 2

Related

R left_join() replacing joined values rather than adding in new columns

I have the following dataframes:
A<-data.frame(AgentNo=c(1,2,3,4,5,6),
N=c(2,5,6,1,9,0),
Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 1 2 1
2 2 5 2
3 3 6 1
4 4 1 1
5 5 9 2
6 6 0 2
B<-data.frame(Rank=c(1,5),
AgentNo.x=c(2,5),
AgentNo.y=c(1,4),
N=c(3,1),
Rarity=c(1,2))
Rank AgentNo.x AgentNo.y N Rarity
1 1 2 1 3 1
2 5 5 4 1 2
I would like to left join B onto A by columns "AgentNo"="AgentNo.y" and "N"="N" but rather than add new columns to A from B I want the same columns from A but where joined values have been updated and taken from B.
For any joined rows I want A.AgentNo to now be B.AgentNo.x, A.N to be B.N and A.Rarity to be B.Rarity. I would like to drop B.Rank and B.Agent.y completely.
The result should be:
Result<-data.frame(AgentNo=c(2,2,3,5,5,6), N=c(2,5,6,1,9,0), Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 2
5 5 9 2
6 6 0 2
After some data wrangling, you can use rows_update to update the rows of A by the values of B:
library(dplyr)
A <- A %>%
mutate(AgentNo.y = AgentNo)
B <- select(B, AgentNo = AgentNo.x, AgentNo.y, N, Rarity)
rows_update(A, B, by = "AgentNo.y") %>%
select(-AgentNo.y)
output
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 1
5 5 9 2
6 6 0 2

Remove groups with only one individual in R without using dplyr package [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
group<-c(1,1,1,1,2,2,3,3,3,3,4,4)
individualID<-c(1,1,2,2,3,3,5,5,6,6,7,7)
X<-rbinom(12,1,0.5)
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data without use of dplyr package to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.
Or another option is with tidyverse - after grouping by 'group', filter the rows where the number of distinct (n_distinct) elements in 'individualID' is greater than 1
library(dplyr)
df1 %>%
group_by(group) %>%
filter(n_distinct(individualID) > 1) %>%
ungroup
# A tibble: 8 × 3
group individualID X
<dbl> <dbl> <int>
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
5 3 5 0
6 3 5 0
7 3 6 1
8 3 6 0
Or with subset and ave from base R
subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
group individualID X
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
7 3 5 0
8 3 5 0
9 3 6 1
10 3 6 0
There are more concise ways for sure, but here is the general idea.
# use your code to get the counts by group
df1_counts <- aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
# create a vector of groups where the count is > 1
keep_groups <- df1_counts$group[df1_counts$individualID > 1]
# filter the rows to only groups you want to keep
df1[df1$group %in% keep_groups,]
# group individualID X
# 1 1 1 0
# 2 1 1 0
# 3 1 2 1
# 4 1 2 0
# 7 3 5 1
# 8 3 5 1
# 9 3 6 0
# 10 3 6 1

How do you duplicate rows n times by group and change one specific column value in R?

I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!
One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0

R: Assign incremental ids based on the groups [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I have the following sample data frame:
> test = data.frame(UserId = sample(1:5, 10, replace = T)) %>% arrange(UserId)
> test
UserId
1 1
2 1
3 1
4 1
5 1
6 3
7 4
8 4
9 4
10 5
I now want another column called loginCount for that user, which is something like assigning incremental ids within each group, something like below. Using the mutate like below creates id within each group, but how do I get the incremental ids within each group independent of each other ?
> test %>% mutate(loginCount = group_indices_(test, .dots = "UserId"))
UserId loginCount
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 3 2
7 4 3
8 4 3
9 4 3
10 5 4
I want something like shown below:
UserId loginCount
1 1
1 2
1 3
1 4
1 5
3 1
4 1
4 2
4 3
5 1
You could group and use row_number:
test %>%
arrange(UserId) %>%
group_by(UserId) %>%
mutate(loginCount = row_number()) %>%
ungroup()
# A tibble: 10 x 2
# Groups: UserId [4]
UserId loginCount
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1
One solution using base R tapply()
test$loginCount <- unlist(tapply(rep(1, nrow(test)), test$UserId, cumsum))
> test
UserId loginCount
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1

Get columns in frame based on values in second frame

I have 2 dataframes. One has a ID column with alot of arranged IDs.
The other one has just specific rows of the first column. Those are my markers.
I need to get the sum of the of the values in a specific column based on the id values of the second column.
The first column may be
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
the second one:
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
what i need to get:
id goals cards group points
1 2 2 1 2-(2+2)
2 3 2 1 0 cause in second list
3 4 2 1 4-(2+1+2)
4 5 1 1 5-(1+2)
5 1 2 1 0 cause in second list
1 2 2 2 2-(2+2)
2 3 2 2 0
3 4 2 2 0
4 5 1 3 5-(1+2)
5 1 2 3 0
Something like: ??
df1<- df1%>%
rowwise() %>%
mutate(points=
goals
-(sum( df1$cards[df1$id <= df2$id & df1$id>df1$id])))
df1 = read.table(text = "
id goals cards
1 2 2
2 3 2
3 4 2
4 5 1
5 1 2
", header=T)
df2 = read.table(text = "
id goals cards
2 3 2
5 1 2
", header=T)
library(dplyr)
# function that gets an id and returns the sum of cards based on df2
GetSumOfCards = function(x) {
ids = min(df2$id[df2$id >= x]) # for a given id of df1 find the minimum id in df2 that is bigger than this id
ifelse(x %in% df2$id, # if the given id exists in df2
0, # sum of cards is zero
sum(df1$cards[df1$id >= x & df1$id <= ids])) # otherwise get sum of cards in df1 from this id until the id obtained before
}
# update function to be vectorised
GetSumOfCards = Vectorize(GetSumOfCards)
df1 %>%
mutate(sum_cards = GetSumOfCards(id), # get sum of cards for each id using the function
points = goals - sum_cards) # get the points
# id goals cards sum_cards points
# 1 1 2 2 4 -2
# 2 2 3 2 0 3
# 3 3 4 2 5 -1
# 4 4 5 1 3 2
# 5 5 1 2 0 1
Based on your updated question, applying a similar function to every row makes the process very slow. So, this solution groups data in a way that you can just count the cards on chunks of data/rows:
df1 = read.table(text = "
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
", header=T)
df2 = read.table(text = "
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
", header=T)
library(dplyr)
df1 %>%
arrange(group, desc(id)) %>% # order by group and id descending (this will help with counting the cards)
left_join(df2 %>% # join specific columns of df2 and add a flag to know that this row exists in df2
select(id, group) %>%
mutate(flag = 1), by=c("id","group")) %>%
mutate(flag = ifelse(is.na(flag), 0, flag), # replace NA with 0
flag2 = cumsum(flag)) %>% # this flag will create the groups we need to count cards
group_by(group, flag2) %>% # for each new group (we need both as the card counting will change when we have a row from df2, or if group changes)
mutate(sum_cards = ifelse(flag == 1, 0, cumsum(cards))) %>% # get cummulative sum of cards unless the flag = 1, where we need 0 cards
ungroup() %>% # forget the grouping
arrange(group, id) %>% # back to original order
mutate(points = goals - sum_cards) %>% # calculate points
select(-flag, -flag2) # remove flags
# # A tibble: 10 x 6
# id goals cards group sum_cards points
# <int> <int> <int> <int> <dbl> <dbl>
# 1 1 2 2 1 4 -2
# 2 2 3 2 1 0 3
# 3 3 4 2 1 5 -1
# 4 4 5 1 1 3 2
# 5 5 1 2 1 0 1
# 6 1 2 2 2 4 -2
# 7 2 3 2 2 0 3
# 8 3 4 2 2 0 4
# 9 4 5 1 3 3 2
# 10 5 1 2 3 0 1

Resources