Select values based on conditions - r

I have the following data set:
id pnum t1 t2 t3 w1 w2 w3
1 1 w r r 1 1 1
1 2 o o w 0 0 1
1 3 o w w 1 1 1
2 1 o w t 1 0 1
2 2 s s s 1 0 1
2 3 s s s 1 0 1
Id defines the group membership.
Based on id and pnum I would like to identify the common measurements reported with pnum 3 at time t with respect to w.
In other words id and pnum define different individuals who took some measurements. In some cases the measurement was taken together in other cases the measurement was taken alone. If taken together than at 'w' we have value 1.
For example:
Common activities at time t:
Id 1 pnum 1 at t1 reported (eg. 1) that a measurement was taken with from the group, more specifically with id1/pnum3. If the measurement data taken together is common in both groups I would like to save it.
Uncommon activities at time t:
Id 2 pnum 1 at t1 reported (eg. 1) that a measurement was taken with from the group, more specifically with id2/pnum2 and pnum 3. In this case the measurement data taken together is uncommon between id2/pnum1 and pnum 2 as well between id2/pnum1 and pnum 3. I don't want to save these measurements. But, I would like to save the common one reported between id2/pnum2 and pnum 3.
Generic example id 1:
In the group with id 1 pnum1 and pnum3 at t1 took together a measurement. Pnum 1 reported w, pnum 2 and pnum 3 reported o. This means that pnum 2 and pnum 3 reported the same measurement. However, when I look at w1 I could observe that they were not together when they did as at w1 pnum 2 is 0 and pnum 3 is 1. In other words, even though the measurement are common for pnum 2 and pnum 3 as they were not taken together I don't want to save the case. I would need to report if they reported the same measurements or not. In this case pnum1 reported w while pnum 3 reported o, so the measurements don't match. Therefore I coded 0. I don't want to save the case.
I would like to identify the common measurements that were taken together at time t.
Output:
id pnum t1 t2 t3
1 1 0 0 0
1 2 0 0 w
1 3 0 s w
2 1 0 0 0
2 2 s 0 s
2 3 s 0 s
Sample data:
df<-structure(list(id=c(1,1,1,2,2,2),pnum=c(1,2,3,1,2,3), t1=c("w","o","o","o","s","s"), t2=c("r","o","w","w","s","s"),t3 = c("r","w","w","t","s","s"), w1= c(1,0,1,1,1,1), w2 = c(1,0,1,0,0,0), w3 = c(1,1,1,1,1,1)), row.names = c(NA, 6L), class = "data.frame")

I don't get why in your expected output there is an "s" at [id1, pnum3, t2] - other than that, I think the following might help you:
First of all, pivoting your data to a "longer" format, where you can group by time, helps you to generalize your code.
library(dplyr)
library(tidyr)
df_longer <- df %>%
pivot_longer(
cols = matches("^[tw]\\d+$"),
names_to = c(".value","time"),
names_pattern = "([tw])(\\d+)"
)
The above pivots your data to look like this:
> head(df_longer)
# A tibble: 6 x 5
id pnum time t w
<dbl> <dbl> <chr> <chr> <dbl>
1 1 1 1 w 1
2 1 1 2 r 1
3 1 1 3 r 1
4 1 2 1 o 0
5 1 2 2 o 0
6 1 2 3 w 1
Now, you can easily group it up and identify those individuals that have given common answers at any given time:
common_answers <- df_longer %>%
arrange(id, time, pnum) %>%
filter(w == 1) %>% # throw out if the answer was given individually
select(-w) %>% # w not needed anymore
group_by(id, time, t) %>% # group by selected answer
filter(n() > 1) %>% # keep only answers given >1 times
ungroup()
This presents you with only only a filtered set of your data where answers were given commonly in group:
> common_answers
# A tibble: 6 x 4
id pnum time t
<dbl> <dbl> <chr> <chr>
1 1 2 3 w
2 1 3 3 w
3 2 2 1 s
4 2 3 1 s
5 2 2 3 s
6 2 3 3 s
// ADDITION:
In case you have to rely on the "wide" format in your output, you can obviously retain all data, modify t so that it only retains its value when it is given commonly by >1 subject and then widen your df again:
common_answers_wide <- df_longer %>%
group_by(id, time, w, t) %>%
mutate(
# retain t only when the response has been given by >1 subject
t = case_when(
w == 0 ~ "0",
n() > 1 ~ t,
T ~ "0"
)
) %>%
ungroup() %>%
select(-w) %>%
pivot_wider(
names_from = time, names_prefix = "t", names_sort = T,
values_from = t
)
That gives you exactly the desired output:
> common_answers_wide
# A tibble: 6 x 5
id pnum t1 t2 t3
<dbl> <dbl> <chr> <chr> <chr>
1 1 1 0 0 0
2 1 2 0 0 w
3 1 3 0 0 w
4 2 1 0 0 0
5 2 2 s 0 s
6 2 3 s 0 s

Related

De-aggregate a data frame

There have been many similar questions (e.g. Repeat each row of data.frame the number of times specified in a column, De-aggregate / reverse-summarise / expand a dataset in R, Repeating rows of data.frame in dplyr), but my data set is of a different structure than the answers to these questions assume.
I have a data frame with the frequencies of measurements within each group and the total number of observations for each outcome per group total_N:
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3))
# A tibble: 2 x 4
group total_N outcome_A outcome_B
<chr> <dbl> <dbl> <dbl>
1 A 4 1 2
2 B 5 4 3
I want to de-aggregate the data, so that the data frame has as many rows as total observations and each outcome has a 1 for all observations with the outcome and a 0 for all observations without the outcome. Thus the final result should be a data frame like this:
# A tibble: 9 x 3
group outcome_A outcome_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As the aggregated data does not contain any information about the frequency of combinations (i.e., the correlation) of outcome_A and outcome_B, this can be ignored.
Here's a tidyverse solution.
As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.
library(tidyverse)
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>%
uncount(total_N) %>%
group_by(group) %>%
mutate(
across(
starts_with("measure"),
function(x) as.numeric(row_number() <= x)
)
) %>%
ungroup()
# A tibble: 9 × 3
group measure_A measure_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

How to create binary variable for each individual based on value in other variable?

So I have a data set containing of 4 individuals. Each individual is measured for different time period. In R:
df = data.frame(cbind("id"=c(1,1,1,2,2,3,3,3,3,4,4), "t"=c(1,2,3,1,2,1,2,3,4,1,2), "x1"=c(0,1,0,1,0,0,1,0,1,0,0)))
and I want to create variable x2 indicating whether there already was 1 in variable x1 for given individual, ie it will look like this:
"x2" = c(0,1,1,1,1,0,1,1,1,0,0)
... ideally with dplyr package. So far I have came here:
new_df = df %>% dplyr::group_by(id) %>% dplyr::arrange(t)
but can not move from this point... The desired result is on picture.
Here is one approach using dplyr:
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = ifelse(row_number() >= min(row_number()[x1 == 1]), 1, 0))
This will add a 1 if the row number is greater or equal to the first row number where x1 is 1; otherwise, it will add a 0.
Note, you will get warnings, as at least one group does not have a value of x1 which equals 1.
Also, another alternative, including if you want NA where no id has a x1 value of 1 (e.g., where id is 4):
df %>%
arrange(id, t) %>%
group_by(id) %>%
mutate(x2 = +(row_number() >= which(x1 == 1)[1]))
Output
id t x1 x2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 1 1
3 1 3 0 1
4 2 1 1 1
5 2 2 0 1
6 3 1 0 0
7 3 2 1 1
8 3 3 0 1
9 3 4 1 1
10 4 1 0 0
11 4 2 0 0

R: Slicing a grouped data frame conditional on a column

I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))

count total and positive samples by group

I have a dataframe like this;
df <- data.frame(concentration=c(0,0,0,0,2,2,2,2,4,4,6,6,6),
result=c(0,0,0,0,0,0,1,0,1,0,1,1,1))
I want to count the total number of results for each concentration level.
I want to count the number of positive samples for each concentration level.
And I want to create a new dataframe with concentration level, total results, and number positives.
conc pos_c total_c
0 0 4
2 1 4
4 1 2
6 3 3
This is what I've come up with so far using plyr;
c <- count(df, "concentration")
r <- count(df, "concentration","result")
names(c)[which(names(c) == "freq")] <- "total_c"
names(r)[which(names(r) == "freq")] <- "pos_c"
cbind(c,r)
concentration total_c concentration pos_c
1 0 4 0 0
2 2 4 2 1
3 4 2 4 1
4 6 3 6 3
Repeating concentration column. I think there is probably a way better/easier way to do this I'm missing. Maybe another library. I'm not sure how to do this in R and it's relatively new to me. Thanks.
We need a group by sum. Using tidyverse, we group by 'concentration (group_by), then summarise to get the two columns - 1) sum of the logical expression (result > 0), 2) number of rows (n())
library(dplyr)
df %>%
group_by(conc = concentration) %>%
summarise(pos_c = sum(result > 0), # in the example just sum(result)
total_c = n())
# A tibble: 4 x 3
# conc pos_c total_c
# <dbl> <int> <int>
#1 0 0 4
#2 2 1 4
#3 4 1 2
#4 6 3 3
Or using base R with table and addmargins
addmargins(table(df), 2)[,-1]

How do I sum recurring values according to a level in a column and output a table of counts?

I'm new to R and I have data that looks something like this:
categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)
Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.
I like my new dataframe output to look something like this:
A B C count
1 1 1 1
1 1 0 2
1 0 1 0
0 1 1 2
1 0 0 2
0 1 0 0
0 0 1 2
0 0 0 0
The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.
Is there a function in R that will help me achieve this? Thank you in advance.
We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:
library(dplyr)
library(tidyr)
as.data.frame.matrix(t(table(dat))) %>%
group_by_all() %>%
summarize(Count = n())
Result:
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<int> <int> <int> <int>
1 0 0 1 2
2 0 1 1 2
3 1 0 0 2
4 1 1 0 2
5 1 1 1 1
Edit (thanks to #C. Braun). Here is how to also include the zero A, B, C combinations:
as.data.frame.matrix(t(table(dat))) %>%
bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
group_by_all() %>%
summarize(Count = n()-1)
or with complete, as suggested by #Ryan:
as.data.frame.matrix(t(table(dat))) %>%
mutate(non_missing = 1) %>%
complete(A, B, C) %>%
group_by(A, B, C) %>%
summarize(Count = sum(ifelse(is.na(non_missing), 0, 1)))
Result:
# A tibble: 8 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 1 2
3 0 1 0 0
4 0 1 1 2
5 1 0 0 2
6 1 0 1 0
7 1 1 0 2
8 1 1 1 1
We have
xxtabs <- function(df, formula) {
xt <- xtabs(formula, df)
xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
as.data.frame(xxt)
}
and
> xxtabs(dat, ~ animals + categories)
A B C Freq
1 0 0 0 0
2 1 0 0 2
3 0 1 0 0
4 1 1 0 2
5 0 0 1 2
6 1 0 1 0
7 0 1 1 2
8 1 1 1 1
(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation
xt <- xtabs(~ animals + categories, dat)
then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame
xxt <- xtabs(~ ., as.data.frame.matrix(xt))
coerced to the desired form
as.data.frame(xxt)
I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.
One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping
dat %>% xxtabs(~ animals + categories)
If I understood you correctly, this should do the trick.
require(tidyverse)
dat %>%
mutate(value = 1) %>%
spread(categories, value) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
group_by(A, B, C) %>%
summarize(Count = n())
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <int>
1 0. 0. 1. 2
2 0. 1. 1. 2
3 1. 0. 0. 2
4 1. 1. 0. 2
5 1. 1. 1. 1
Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.
dcast(as.data.table(dat), animals ~ categories, length)[
CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]

Resources