Filter ids with having count > 1 in data.table [duplicate] - r

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I would like to subset my data frame to keep only groups that have 3 or more observations on DIFFERENT days. I want to get rid of groups that have less than 3 observations, or the observations they have are not from 3 different days.
Here is a sample data set:
Group Day
1 1
1 3
1 5
1 5
2 2
2 2
2 4
2 4
3 1
3 2
3 3
4 1
4 5
So for the above example, group 1 and group 3 will be kept and group 2 and 4 will be removed from the data frame.
I hope this makes sense, I imagine the solution will be quite simple but I can't work it out (I'm quite new to R and not very fast at coming up with solutions to things like this). I thought maybe the diff function could come in handy but didn't get much further.

With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.

One idea using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
filter(length(unique(Day)) >= 3)
#Source: local data frame [7 x 2]
#Groups: Group [2]
# Group Day
# (int) (int)
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#5 3 1
#6 3 2
#7 3 3

We can use base R
i1 <- rowSums(table(df1)!=0)>=3
subset(df1, Group %in% names(i1)[i1])
# Group Day
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#9 3 1
#10 3 2
#11 3 3
Or a one-liner base R would be
df1[with(df1, as.logical(ave(Day, Group, FUN = function(x) length(unique(x)) >=3))),]

Related

Count the amount of times value A occurs without value B and vice versa

I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
Output
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
A more efficient approach would be to work on a tabulation format:
tab = crossprod(table(rep(seq_len(nrow(df)), ncol(df)), c(df$id1, df$id2)))
#tab
#
# 1 2 3 4
# 1 7 3 2 2
# 2 3 6 1 2
# 3 2 1 4 1
# 4 2 2 1 5
So, now, we have the times each value appears with another (irrespectively of their order in the two columns). Here on, we need a way to subset the above table by each pair and subtract the value of their cooccurence from the value of each id's total appearance.
Make a grid of all combinations:
gr = expand.grid(id1 = colnames(tab), id2 = rownames(tab), stringsAsFactors = FALSE)
Create 2-column matrices to subset the table:
id1.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id1, rownames(tab)))
id2.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id2, rownames(tab)))
Subtract the respective values:
cbind(gr, count = tab[id1.ij] - tab[id2.ij])
# id1 id2 count
#1 1 1 0
#2 2 1 3
#3 3 1 2
#4 4 1 3
#5 1 2 4
#6 2 2 0
#7 3 2 3
#8 4 2 3
#9 1 3 5
#10 2 3 5
#11 3 3 0
#12 4 3 4
#13 1 4 5
#14 2 4 4
#15 3 4 3
#16 4 4 0
Of course, if we do not need the full grid of values, we can set:
gr = unique(df)
which results in:
# id1 id2 count
#1 1 2 4
#3 1 3 5
#4 1 4 5
#5 2 1 3
#6 2 3 5
#7 2 4 4
#8 3 1 2
#9 3 4 3
#10 4 2 3
#11 4 1 3

R: Matching and repeating occurence [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
(sample code below) I have two data sets. One is a library of products, the other is customer id, date and viewed product and another detail.I want to get a merge where I see per each id AND date all the library of products as well as where the match was. I have tried using full_join and merge and right and left joins, but they do not repeat the rows. below is the sample of what i am trying to achieve.
id=c(1,1,1,1,2,2)
date=c(1,1,2,2,1,3)
offer=c('a','x','y','x','y','a')
section=c('general','kitchen','general','general','general','kitchen')
t=data.frame(id,date,offer,section)
offer=c('a','x','y','z')
library=data.frame(offer)
######
t table
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 2 y general
4 1 2 x general
5 2 1 y general
6 2 3 a kitchen
library table
offer
1 a
2 x
3 y
4 z
and i want to get this:
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 1 y NA
4 1 1 z general
...
(there would have to be 6*4 observations)
I realize because I match by offer it is not going to repeat the values like so, but what is another option to do that? Thanks a lot!!
You can use complete to get all combinations of library$offer for each id and date.
tidyr::complete(t, id, date, offer = library$offer)
# A tibble: 24 x 4
# id date offer section
# <dbl> <dbl> <chr> <chr>
# 1 1 1 a general
# 2 1 1 x kitchen
# 3 1 1 y NA
# 4 1 1 z NA
# 5 1 2 a NA
# 6 1 2 x general
# 7 1 2 y general
# 8 1 2 z NA
# 9 1 3 a NA
#10 1 3 x NA
# … with 14 more rows
You can use tidyr and dplyr to get the data. The crossing() function will create all combinations of the variables you pass in
library(dplyr)
library(tidyr)
t %>%
select(id, date) %>%
{crossing(id=.$id, date=.$date, library)} %>%
left_join(t)

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Create a rolling index of pairs over groups

I need to create (with R) a rolling index of pairs from a data set that includes groups. Consider the following data set:
times <- c(4,3,2)
V1 <- unlist(lapply(times, function(x) seq(1, x)))
df <- data.frame(group = rep(1:length(times), times = times),
V1 = V1,
rolling_index = c(1,1,2,2,3,3,4,5,5))
df
group V1 rolling_index
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 2 1 3
6 2 2 3
7 2 3 4
8 3 1 5
9 3 2 5
The data frame I have includes the variables group and V1. Within each group V1 designates a running index (that may or may not start at 1).
I want to create a new indexing variable that looks like rolling_index. This variable groups rows within the same group and consecutive V1 value, thus creating a new rolling index. This new index must be consecutive over groups. If there is an uneven amount of rows within a group (e.g. group 2), then the last, single row gets its own rolling index value.
You can try
library(data.table)
setDT(df)[, gr:=as.numeric(gl(.N, 2, .N)), group][,
rollindex:=cumsum(c(TRUE,abs(diff(gr))>0))][,gr:= NULL]
# group V1 rolling_index rollindex
#1: 1 1 1 1
#2: 1 2 1 1
#3: 1 3 2 2
#4: 1 4 2 2
#5: 2 1 3 3
#6: 2 2 3 3
#7: 2 3 4 4
#8: 3 1 5 5
#9: 3 2 5 5
Or using base R
indx1 <- !duplicated(df$group)
indx2 <- with(df, ave(group, group, FUN=function(x)
gl(length(x), 2, length(x))))
cumsum(c(TRUE,diff(indx2)>0)|indx1)
#[1] 1 1 2 2 3 3 4 5 5
Update
The above methods are based on the 'group' column. Suppose you already have a sequence column ('V1') by group as showed in the example, creation of rolling index is easier
cumsum(!!df$V1 %%2)
#[1] 1 1 2 2 3 3 4 5 5
As mentioned in the post, if the 'V1' column do not start at '1' for some groups, we can get the sequence from the 'group' and then do the cumsum as above
cumsum(!!with(df, ave(seq_along(group), group, FUN=seq_along))%%2)
#[1] 1 1 2 2 3 3 4 5 5
There is probably a simpler way but you can do:
rep_each <- unlist(mapply(function(q,r) {c(rep(2, q),rep(1, r))},
q=table(df$group)%/%2,
r=table(df$group)%%2))
df$rolling_index <- inverse.rle(x=list(lengths=rep_each, values=seq(rep_each)))
df$rolling_index
#[1] 1 1 2 2 3 3 4 5 5

Resources