transform count table into disaggregated table of observations - r

I have data in the form of a count table of successes and trials, but for modeling I need these data in a disaggregated trial-level table.
How do I get from this:
dplyr::tibble(
user_id = c(1,2),
success = c(3,4),
trials = c(9, 10)
)
To this:
dplyr::tibble(
user_id = c(rep(1, 9), rep(2, 10)),
success = c(rep(1, 3),rep(0, 6), rep(1, 4), rep(0, 6))
)

We can uncount based on the 'trials', then grouped by 'user_id', change the 'success' to binary by creating a logical condition with row_number
library(dplyr)
library(tidyr)
df1 %>%
uncount(trials) %>%
group_by(user_id) %>%
mutate(success = +(row_number() <= first(success))) %>%
ungroup
# A tibble: 19 x 2
# user_id success
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 0
# 6 1 0
# 7 1 0
# 8 1 0
# 9 1 0
#10 2 1
#11 2 1
#12 2 1
#13 2 1
#14 2 0
#15 2 0
#16 2 0
#17 2 0
#18 2 0
#19 2 0
Or with base R using Map and stack
stack(setNames(Map(function(x, y) rep(1:0, c(x, y)),
df1$success, df1$trials - df1$success), df1$user_id))[2:1]

Related

Efficient way to add sample information as new column to data set

I know how I can subset a data frame by sampling certain rows. However, I'm struggling with finding an easy (preferably tidyverse) way to just ADD the sampling information as a new column to my data set, i.e. I simply want to populate a new column with "1" if it is sampled and "0" if not.
I currently have this one, but it feels overly complicated. Note, in the example I want to sample 3 rows per group.
df <- data.frame(group = c(1,2,1,2,1,1,1,1,2,2,2,2,2,1,1),
var = 1:15)
library(tidyverse)
df <- df %>%
group_by(group) %>%
mutate(sampling_info = sample.int(n(), size = n(), replace = FALSE),
sampling_info = if_else(sampling_info <= 3, 1, 0))
You can try -
library(dplyr)
set.seed(123)
df %>%
arrange(group) %>%
group_by(group) %>%
mutate(sampling_info = as.integer(row_number() %in% sample(n(), size = 3))) %>%
ungroup
# group var sampling_info
# <dbl> <int> <int>
# 1 1 1 0
# 2 1 3 0
# 3 1 5 1
# 4 1 6 0
# 5 1 7 0
# 6 1 8 0
# 7 1 14 1
# 8 1 15 1
# 9 2 2 0
#10 2 4 1
#11 2 9 1
#12 2 10 0
#13 2 11 0
#14 2 12 1
#15 2 13 0
sample(n(), size = 3) will generate 3 random row numbers for each group and we assign 1 for those row numbers.

Calculate sum of n previous rows

I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

How can I filter by subjects who have all levels of a factor?

I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?
We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.
I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1

Filter Rows Between with Multiple Events per Subject

I have a large data set and I'm trying to filter the days following a specific event for each subject. This issue is that the "event" of interest may happen multiple times for some subjects and for a few subjects the event doesn't happen at all (in which case they could just be removed from the summarized data).
Here is an example of the data and what I've tried:
library(tidyverse)
set.seed(355)
subject <- c(rep(LETTERS[1:4], each = 40), rep("E", times = 40))
event <- c(sample(0:1, size = length(subject)-40, replace = T, prob = c(0.95, 0.05)), rep(0, times = 40))
df <- data.frame(subject, event)
df %>%
filter(event == 1) %>%
count(subject, event, sort = T)
# A tibble: 4 x 3
subject event n
<fct> <dbl> <int>
1 D 1 3
2 A 1 2
3 B 1 2
4 C 1 2
So we see that subject D has had the event 3 times while subjects A, B, and C have had the event 2 times. Subject E has not had the event at all.
My next step was to create an "event" tag that identifies where each event happened and then produced an NA for all over rows. I also created an event sequence, which sequences along between events, because I thought it might be useful, but I didn't end up trying to use it.
df_cleaned <- df %>%
group_by(subject, event) %>%
mutate(event_seq = seq_along(event == 1),
event_detail = ifelse(event == 1, "event", NA)) %>%
as.data.frame()
I tried two different approaches using a filter() and between() to get each event and the 2 rows following each event. Both of these approaches create an error because of the multiple events within subject. I can't figure out a good workaround for it.
Approach 1:
df_cleaned %>%
group_by(subject) %>%
filter(., between(row_number(),
left = which(!is.na(event_detail)),
right = which(!is.na(event_detail)) + 1))
Approach 2:
df_cleaned %>%
group_by(subject) %>%
mutate(event_group = cumsum(!is.na(event_detail))) %>%
filter(., between(row_number(), left = which(event_detail == "event"), right = which(event_detail == "event") + 2))
If you want to get rows with 1 in event and the following two rows, you can do the following. Thanks to Ananda Mahto who is the author of splitstackshape package, we can handle this type of operation with getMyRows(), which returns a list. You can specify a range of rows in the function. Here I said 0:2. So I am asking R to take each row with 1 in event and the following two rows. I used bind_rows() to return a data frame. But if you need to work with a list, you do not have to do that.
install_github("mrdwab/SOfun")
library(SOfun)
library(dplyr)
ind <- which(x = df$event == 1)
bind_rows(getMyRows(data = df, pattern = ind, range = 0:2))
subject event
1 A 1
2 A 0
3 A 0
4 A 1
5 A 0
6 A 0
7 B 1
8 B 0
9 B 0
10 B 1
11 B 0
12 B 0
13 C 1
14 C 0
15 C 0
16 C 1
17 C 0
18 C 0
19 D 1
20 D 0
21 D 0
22 D 1
23 D 0
24 D 0
25 D 1
26 D 0
27 D 0
Here is a tidyverse approach which uses cumsum() to create groups of rows after (and including) an event and which picks the top 3 rows of each group:
df %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 27 x 3
# Groups: subject, event_group [9]
subject event event_group
<fct> <dbl> <int>
1 A 1 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 0 2
6 A 0 2
7 B 1 1
8 B 0 1
9 B 0 1
10 B 1 2
# … with 17 more rows
For testing an edge case, here is a modified data set where subject A starts with three subsequent events. Furthermore, I have added row numbers rn in order to check that the correct rows are picked:
df2 <- df %>%
mutate(event = ifelse(row_number() <= 2L, 1L, event),
rn = row_number())
Now we get
df2 %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 29 x 4
# Groups: subject, event_group [11]
subject event rn event_group
<fct> <dbl> <int> <int>
1 A 1 1 1
2 A 1 2 2
3 A 1 3 3
4 A 0 4 3
5 A 0 5 3
6 A 1 22 4
7 A 0 23 4
8 A 0 24 4
9 B 1 59 1
10 B 0 60 1
# … with 19 more rows
which is in line with my expectations for this edge case.
Here is a base R option which looks similar to #jazzurro's attempt. We get the row indices where event == 1, then select next two rows from each index, use unique so in case there are overlapping indices we select only the unique ones and subset it from the original df.
inds <- which(df$event == 1)
df[unique(c(sapply(inds, `+`, 0:2))), ]
# subject event
#3 A 1
#4 A 0
#5 A 0
#22 A 1
#23 A 0
#24 A 0
#59 B 1
#60 B 0
#61 B 0
#62 B 1
#63 B 0
#64 B 0
#....
Another option using dplyr, could be using lag
library(dplyr)
df %>%
group_by(subject) %>%
filter(event == 1 | lag(event) == 1 | lag(event, 2) == 1)

Resources