counting indicator respect of 2 groups - r

I have a group and persons in each group. and an indicator. How to count indicator per each group for each person element?
group person ind
1 1 1
1 1 1
1 2 1
2 1 0
2 2 1
2 2 1
output
so in the first group 2 persons have 1 in ind, and second group one person so
group person ind. count
1 1 1 2
1 1 1 2
1 2 1 2
2 1 0 1
2 2 1 1
2 2 1 1

Could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
count = n_distinct(person[ind == 1])
)
Output:
# A tibble: 6 x 4
# Groups: group [2]
group person ind count
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 1 2
3 1 2 1 2
4 2 1 0 1
5 2 2 1 1
6 2 2 1 1
Or in data.table:
library(data.table)
setDT(df)[, count := uniqueN(person[ind == 1]), by = group]

An option using base R
df1$count <- with(df1, ave(ind* person, group, FUN =
function(x) length(unique(x[x!=0]))))
df1$count
#[1] 2 2 2 1 1 1

Related

In R: Subset observations that have values, 0, 1, and 2 by group

I have the following data:
companyID status
1 1
1 1
1 0
1 2
2 1
2 1
2 1
3 1
3 0
3 2
3 2
3 2
And would like to subset those observations (by companyID) where status has 0, 1, and 2 across the group (companyID). My preferred outcome would look like the following:
companyID status
1 1
1 1
1 0
1 2
3 1
3 0
3 2
3 2
3 2
Thank you in advance for any help!!
You can select groups where all the values from 0-2 are present in the group.
library(dplyr)
df %>% group_by(companyID) %>%filter(all(0:2 %in% status))
# companyID status
# <int> <int>
#1 1 1
#2 1 1
#3 1 0
#4 1 2
#5 3 1
#6 3 0
#7 3 2
#8 3 2
#9 3 2
In base R and data.table :
#Base R :
subset(df, as.logical(ave(status, companyID, FUN = function(x) all(0:2 %in% x))))
#data.table
library(data.table)
setDT(df)[, .SD[all(0:2 %in% status)], companyID]
We can use
library(dplyr)
df %>%
group_by(companyID) %>%
filter(sum(0:2 %in% status) == 3)

Response change analysis in r

I am trying to explore the response change patterns for particular questions. Here is an example of dataset.
id <- c(1,1,1, 2,2,2, 3,3,3,3, 4,4)
item.id <- c(1,1,1, 1,1,1 ,1,1,2,2, 1,1)
sequence <- c(1,2,3, 1,2,3, 1,2,1,2, 1,2)
score <- c(0,0,0, 0,0,1, 0,1,0,0, 1,0)
data <- data.frame("id"=id, "item.id"=item.id, "sequence"=sequence, "score"=score)
data
id item.id sequence score
1 1 1 1 0
2 1 1 2 0
3 1 1 3 0
4 2 1 1 0
5 2 1 2 0
6 2 1 3 1
7 3 1 1 0
8 3 1 2 1
9 3 2 1 0
10 3 2 2 0
11 4 1 1 1
12 4 1 2 0
id represents persons, item.id is for questions. sequence is for the attempt to change the response, and the score is the score of the item.
What I am trying to observe is to subset those whose score were changed from 0 to 1 and 1 to 0.
The desired outputs would be:
data.0.to.1
id item.id sequence score
2 1 1 0
2 1 2 0
2 1 3 1
3 1 1 0
3 1 2 1
data.1.to.0
id item.id sequence score
4 1 1 1
4 1 2 0
Any thoughts? Thanks!
Here is one option by taking the difference of 'score' grouped by 'id', 'item.id'
library(dplyr)
data %>%
group_by(id, item.id) %>%
filter(any(score != 0)) %>%
mutate(ind = c(0, diff(score))) %>%
group_by(ind = ind[ind!=0][1]) %>%
group_split(ind, keep = FALSE)
#[[1]]
# A tibble: 2 x 4
# id item.id sequence score
# <dbl> <dbl> <dbl> <dbl>
#1 4 1 1 1
#2 4 1 2 0
#[[2]]
# A tibble: 5 x 4
# id item.id sequence score
# <dbl> <dbl> <dbl> <dbl>
#1 2 1 1 0
#2 2 1 2 0
#3 2 1 3 1
#4 3 1 1 0
#5 3 1 2 1
I'd do this:
library(dplyr)
data.0.to.1 = data %>%
group_by(id, item.id) %>%
filter(any(diff(score) > 0))
data.1.to.0 = data %>%
group_by(id, item.id) %>%
filter(any(diff(score) < 0))

Building sum of dynamic number of rows in dplyr

My df looks something like the first three columns of the following:
ID VAL LENGTH SUM
1 1 1 1
1 1 1 1
1 1 2 2
1 1 2 2
2 0 1 0
2 3 1 0
2 4 2 3
I want to add a fourth column, which is defined as the sum of the group's first to LENGTH-st values in VAL.
How do I do that?
You could do:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(SUM = sapply(LENGTH, function(x) sum(VAL[1:x])))
Output:
# A tibble: 7 x 4
# Groups: ID [2]
ID VAL LENGTH SUM
<int> <int> <int> <dbl>
1 1 1 1 1
2 1 1 1 1
3 1 1 2 2
4 1 1 2 2
5 2 0 1 0
6 2 3 1 0
7 2 4 2 3

Create a combination ID number from a set of factors in R

can anyone help me out in computing a new variable that will number a distinct combination from some factors?
Assuming there are 4 within subject factors (A, B, C, D) with 8 repetitions of each combination for any of 10 subjects, this is how my data could look like to represent it's actual structure:
library(AlgDesign) #for generating a factorial design)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
> head(df)
A B C D replication Subject
1 1 1 1 1 1 1
2 2 1 1 1 1 1
3 1 2 1 1 1 1
4 2 2 1 1 1 1
5 1 1 2 1 1 1
6 2 1 2 1 1 1
> tail(df)
A B C D replication Subject
1275 1 2 1 2 8 10
1276 2 2 1 2 8 10
1277 1 1 2 2 8 10
1278 2 1 2 2 8 10
1279 1 2 2 2 8 10
1280 2 2 2 2 8 10
In this example replication was simply generated in order to force 8 reps but it doesnt "code" the combintation itself.
My original data has only variables A, B, C, D and Subject and I'd like to compute replication in a way that it has distinct values
but for each combination of A, B, C, D
library(AlgDesign)
library(dplyr)
df <-gen.factorial(c(2,2,2,2,8,10), factors = "all",
varNames = c("A", "B", "C", "D", "replication", "Subject"))
df %>%
rowwise() %>% # for each row
mutate(factors = paste0(c(A,B,C,D), collapse = "_")) %>% # create a combination of your factors
ungroup() %>% # forget the row grouping
mutate(replication_upd = as.numeric(factor(factors))) # create a number based on the combination you have
# # A tibble: 1,280 x 8
# A B C D replication Subject factors replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <chr> <dbl>
# 1 1 1 1 1 1 1 1_1_1_1 1
# 2 2 1 1 1 1 1 2_1_1_1 9
# 3 1 2 1 1 1 1 1_2_1_1 5
# 4 2 2 1 1 1 1 2_2_1_1 13
# 5 1 1 2 1 1 1 1_1_2_1 3
# 6 2 1 2 1 1 1 2_1_2_1 11
# 7 1 2 2 1 1 1 1_2_2_1 7
# 8 2 2 2 1 1 1 2_2_2_1 15
# 9 1 1 1 2 1 1 1_1_1_2 2
#10 2 1 1 2 1 1 2_1_1_2 10
# # ... with 1,270 more rows
You can remove any unnecessary variables. I left them there so you can see how the process works.
Another option is this
# create a look up table based on unique combinations and assign them a number
df %>% distinct(A,B,C,D) %>% mutate(replication_upd = row_number()) -> look_up
# join back to original dataset
df %>% inner_join(look_up, by=c("A","B","C","D")) %>% tbl_df()
# # A tibble: 1,280 x 7
# A B C D replication Subject replication_upd
# <fct> <fct> <fct> <fct> <fct> <fct> <int>
# 1 1 1 1 1 1 1 1
# 2 2 1 1 1 1 1 2
# 3 1 2 1 1 1 1 3
# 4 2 2 1 1 1 1 4
# 5 1 1 2 1 1 1 5
# 6 2 1 2 1 1 1 6
# 7 1 2 2 1 1 1 7
# 8 2 2 2 1 1 1 8
# 9 1 1 1 2 1 1 9
# 10 2 1 1 2 1 1 10
# # ... with 1,270 more rows
Note that the first approach picks the numbers based on the new variable we create (i.e. orders A,B,C,D), and the second approach uses the initial order of you dataset to pick the number for each unique combination.

dplyr how to count cycles in the records

For example, if I have records like:
A B
1 2
2 3
3 1
1 2
2 1
Let's say one cycle is from 1 (to 2 to 3) back to 1,so I need my data frame to be like
No. A B
cycle1 1 2
cycle1 2 3
cycle1 3 1
cycle2 1 2
cycle2 2 1
Or a better way for me, I just need to record the time the same record appears, like
Time A B
Time1 1 2
Time1 2 3
Time1 3 1
Time2 1 2
Time1 2 1
I need to do this because I have to use summarize function in dplyr to do calculation but I cannot group data by A and B directly. The order of the data is also important.
Is this what you want ?
library(zoo)
T1=which(df$A==1)
T2=1:length(T1)
T2=paste('cycle',T2 )
df$No=NA
df$No[T1]=T2
df$No=na.locf(df$No)
df
A B No
1 1 2 cycle 1
2 2 3 cycle 1
3 3 1 cycle 1
4 1 2 cycle 2
5 2 1 cycle 2
#the reason: keep the row Id with the calculation
library(dplyr)
df%>%group_by(A,B)%>%mutate(Time=paste('Time',row_number()))
A B Time
<int> <int> <chr>
1 1 2 Time 1
2 2 3 Time 1
3 3 1 Time 1
4 1 2 Time 2
5 2 1 Time 1
Create an augmented 'diff' variable. c(NA , diff (your_var)). Within a sequence group this will be 1. Set your group to change at the logical falsity of that proposition. (My first iteration on the algorithm wasn't quite correct so modified it slightly.)
dat %>% as_tibble() %>% mutate(G = cumsum( c(-1, diff(A)) < 0 ) )
# A tibble: 5 x 3
A B G
<int> <int> <int>
1 1 2 1
2 2 3 1
3 3 1 1
4 1 2 2
5 2 1 2
dat %>% as_tibble() %>% mutate(G = paste0( "time", cumsum( c(-1, diff(A)) < 0 ) ))
# A tibble: 5 x 3
A B G
<int> <int> <chr>
1 1 2 time1
2 2 3 time1
3 3 1 time1
4 1 2 time2
5 2 1 time2
One could also test for A=1, but then sequences like 1,2,3,2,3,4 would not get properly split.

Resources