Hot to do subset in R with aggregation? - r

I have a large data frame (200000 rows) with 2 columns: group_id and user_id. One user can belong to multiple groups. I need a result dataframe with group_id, user_id of all users who belong to more than 3 groups.
group_id user_id
100 1
101 1
102 1
103 1
101 2
103 2
In the above example, in the resultant data frame I will get the first 4 rows only.
df <- structure(list(group_id = c(100L, 101L, 102L, 103L, 101L, 103L
), user_id = c(1L, 1L, 1L, 1L, 2L, 2L)), .Names = c("group_id",
"user_id"), class = "data.frame", row.names = c(NA, -6L))

The "data.table" package makes this simple. If df is the original data frame, you can do
library(data.table)
setDT(df)[, .SD[.N > 3], by = user_id]
# user_id group_id
# 1: 1 100
# 2: 1 101
# 3: 1 102
# 4: 1 103
.N tells us how many rows are in each group (here chosen to be user_id), and .SD selects all columns. So .SD[.N > 3] gives us all groups that have more than three rows.
Note: If you don't want to change the original df to a data table, you can use as.data.table() in place of setDT(). However, this will make a copy of df.

Here's a dplyr solution but in seeing #Richard's I know there's a better dplyr way too:
library(dplyr)
df %>%
count(user_id) %>%
filter(n > 3) %>%
select(user_id) %>%
inner_join(df, .)
## Joining by: "user_id"
## group_id user_id
## 1 100 1
## 2 101 1
## 3 102 1
## 4 103 1
Using #Richard's comment:
df %>%
group_by(user_id) %>%
filter(n() > 3)

Assuming that 'group_id' is unique per each 'user_id', an option using base R would be
df[with(df, ave(user_id, user_id, FUN=length)>3),]
# group_id user_id
#1 100 1
#2 101 1
#3 102 1
#4 103 1

Related

R - create indicator column for whether a value appears within a group

I have a dataframe df with a set of IDs that may appear multiple times with a different Status for each row. I need to create a 0/1 indicator column for whether Status "B" ever appears for that ID. B_appears shows my desired result.
I have done something kind of related by creating a "Count" column that counts the number of times the Status listed in that row appears for that ID. But I can't figure out how to create the indicator variable that is specifically related to Status "B."
This is how I created the "Count" column, fwiw.
df <- ddply(df),.(ID,Status), transform, Count = length(ID))
Thanks in advance!
ID
Status
Count
B_appears
1
A
1
0
2
A
1
1
2
B
2
1
2
B
2
1
3
A
1
1
3
B
1
1
With tidyverse, we group by 'ID', get the Count column with group size (n()) and the 'B_appears' by creating a logical vector check whether 'B' is %in% the Status and convert the logical to binary (+ or as.integer)
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(Count = n(),
B_appears = +('B' %in% Status)) %>%
# or may also create B_appears as
# B_appears = +(any(Status %in% 'B'))) %>%
ungroup
-output
# A tibble: 6 × 4
ID Status Count B_appears
<int> <chr> <int> <int>
1 1 A 1 0
2 2 A 3 1
3 2 B 3 1
4 2 B 3 1
5 3 A 2 1
6 3 B 2 1
data
df <- structure(list(ID = c(1L, 2L, 2L, 2L, 3L, 3L), Status = c("A",
"A", "B", "B", "A", "B")), row.names = c(NA, -6L), class = "data.frame")

Summarizing a dataframe in R with multiple functions in place?

I am new to R and trying to summarize a dataframe with multiple functions and I would like the result to appear in the same column, rather than in separated columns for each function. For example, my data set looks something like this
data =
A B
----
1 2
2 2
3 2
4 2
And I call summarize_all(data, c(min, max)) the dataframe becomes
a_fn1 b_fn1 a_fn2 b_fn2
1 2 4 2
How can I make it so that the result of the summarize_all becomes this:
A B
----
1 2
4 2
Thanks
Does this work:
library(dplyr)
bind_rows(apply(data,2,min),apply(data,2,max))
# A tibble: 2 x 2
A B
<dbl> <dbl>
1 1 2
2 4 2
Here is an option with transpose
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = everything()) %>%
group_by(name) %>%
summarise(min = min(value), max = max(value)) %>%
data.table::transpose(., make.names = 'name')
A B
1 1 2
2 4 2
data
df1 <- structure(list(A = 1:4, B = c(2L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-4L))

Merge columns within dataframe based on column value R

I currently have a data frame of this structure
ID-No cigsaday activity
1 NA 1
2 NA 1
1 5 NA
2 5 NA
I want to concatenate the rows with the identical ID numbers and create a new data frame that is supposed to look like this
ID-No cigsaday activity
1 5 1
2 5 1
The data frame includes characters as well as numerical, in this way we would match based on a participant ID which occurs 4 times in the dataset within the first column.
Any help is appreciated!
A data.table option
> setDT(df)[, lapply(.SD, na.omit), ID_No]
ID_No cigsaday activity
1: 1 5 1
2: 2 5 1
Data
> dput(df)
structure(list(ID_No = c(1L, 2L, 1L, 2L), cigsaday = c(NA, NA,
5L, 5L), activity = c(1L, 1L, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
Many ways lead to Rome. For the sake of completeness, here are some other approaches which return the expected result for the given sample dataset. Your mileage may vary.
1. dplyr, na.omit()
library(dplyr)
df %>%
group_by(ID_No) %>%
summarise(across(everything(), na.omit))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
ID_No cigsaday activity
<int> <int> <int>
1 1 5 1
2 2 5 1
Note, this a dplyr version of ThomasIsCoding's answer.
2. dplyr, reduce(), coalesce()
library(dplyr)
df %>%
group_by(ID_No) %>%
summarise(across(everything(), ~ purrr::reduce(.x, coalesce)))
3. data.table, fcoalesce()
library(data.table)
setDT(df)[, lapply(.SD, function(x) fcoalesce(as.list(x))), ID_No]
ID_No cigsaday activity
1: 1 5 1
2: 2 5 1
4. data.table, Reduce(), fcoalesce()
library(data.table)
setDT(df)[, lapply(.SD, Reduce, f = fcoalesce), ID_No]
A possible solution using na.locf() which replaces a value with the most recent non-NA value.
library(zoo)
dat %>%
group_by(IDNo) %>%
mutate_at(vars(-group_cols()),.funs=function(x) na.locf(x)) %>%
distinct(IDNo,cigsaday,activity,.keep_all = TRUE) %>%
ungroup()

How do I merge multiple contingency tables into one using R?

I have multiple columns that I need to merge and return a contingency table counting each number.
Example of an ordinal data set:
df <- data.frame(ab = c(1,2,3,4,5),
ba = c(1,3,3,3,5))
>ab ba
1 1
2 3
3 3
4 3
5 5
I would like to be able to return a contingency table showing like this:
>1 2 3 4 5
2 1 4 1 2
Ive attempted examples featured on here for similar issues, but I get the sums returned instead of a count:
library(plyr)
colSums(rbind.fill(data.frame(t(unclass(df$ab))), data.frame(t(unclass(df$ba)))),`
na.rm = T)
Any help is greatly appreciated
We unlist the data.frame into a vector and apply table in base R
table(unlist(df))
# 1 2 3 4 5
# 2 1 4 1 2
Or with tidyverse, by reshaping the data into 'long' format with pivot_longer and get the count
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
count(value)
data
df <- structure(list(ab = 1:5, ba = c(1L, 3L, 3L, 3L, 5L)),
class = "data.frame", row.names = c(NA,
-5L))

For i in loops in R

I have been really struggling to grasp a basic programming concept - the for loop. I typically deal with heirarchically structured data such that measurements repeat with levels of unique identifiers, like this:
ID Measure
1 2
1 3
1 3
2 4
2 1
...
Very often I need to create a new column the aggregates within ID or produces a value for each row for each level of ID. The former I use pretty basic functions from either base or dplyr, but for the latter case I'd like to get in the habit of creating for loops.
So for this example, I would like a column added to my hypothetical df such that the new column begins with one for each ID and adds 1 to each subsequent row, until a new ID occurs.
So, this:
ID Measure NewVal
1 2 1
1 3 2
1 3 3
2 4 1
2 1 2
...
Would love to learn for computing, but if there are other ways, would like to hear those too.
One way is to use the splitstackshape package. There is a function called getanID. This is your friend here. If your df is called mydf, you would do the following. Please note that the outcome is data.table. If necessary, you want to convert that to data.frame.
library(splitstackshape)
getanID(mydf, "ID")
# ID Measure .id
#1: 1 2 1
#2: 1 3 2
#3: 1 3 3
#4: 2 4 1
#5: 2 1 2
DATA
mydf <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame", row.names = c(NA,
-5L))
seq_along gives an increasing sequence starting at 1, with the same length as its input. tapply is used to apply a function to various levels of input. Here we don't care what is supplied, so you can apply the ID column to itself:
> d$NewVal <- unlist(tapply(d$ID, d$ID, FUN=seq_along))
> d
ID Measure NewVal
1 1 2 1
2 1 3 2
3 1 3 3
4 2 4 1
5 2 1 2
You could also use data.table to assign the sequence by reference.
# library(data.table)
setDT(mydf) ## convert to data table
mydf[,NewVal := seq(.N), by=ID] ## .N contains number of rows in each ID group
# ID Measure NewVal
# 1: 1 2 1
# 2: 1 3 2
# 3: 1 3 3
# 4: 2 4 1
# 5: 2 1 2
setDF(mydf) ## convert easily to data frame if you wish.
Or you could use ave. The advantage is that it will give the sequence in the same order as that in the original dataset, which may be beneficial in unordered datasets.
transform(df, NewVal=ave(ID, ID, FUN=seq_along))
# ID Measure NewVal
#1 1 2 1
#2 1 3 2
#3 1 3 3
#4 2 4 1
#5 2 1 2
For a more general case (if the ID column is factor )
transform(df, NewVal=ave(seq_along(ID), ID, FUN=seq_along))
Or if the ID column is ordered
df$NewVal <- sequence(tabulate(df$ID))
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewVal=row_number())
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame",
row.names = c(NA, -5L))
I'd recommend you don't use a for loop for this. It's not a good place for one. You can do this pretty easily inplyr (or dplyr) if you prefer:
require(plyr)
x <- data.frame(cbind(rnorm(100), rnorm(100)))
x$ID <- sample(1:10, 100, replace=T)
new_col <- function(x) {
x <- x[order(x[,1]), ]
x$NewVal <- 1:nrow(x)
return(x)
}
x <- ddply(.data= x, .var= "ID", .fun= new_col)

Resources