Summarizing/counting multiple binary variables - r

For the purpose of this question, my data set includes 16 columns (c1_d, c2_d, ..., c16_d) and 364 rows (1-364). This is what it briefly looks like:
c1_d c2_d c3_d c4_d c5_d c6_d c7_d c8_d c9_d c10_d c11_d c12_d c13_d c14_d c15_d c16_d
1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
2 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0
4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
5 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0
Please note that for example row 1, has five 1s and 11 0s.
This is what I'm trying to do: Basically counting how many rows have how many of the value 1 assigned to them (i.e. by the end of this analysis I want to get something like 20 rows had zero value 1 assigned to them, 33 rows had one value 1 assigned to them, 100 rows had 10 value 1 assigned to them, etc.).
I tried to create a data frame including all rows (364) and columns (16) I needed. I tried using the print.data.frame function, and its results are shown above, but it doesn't give me the number of 0s and 1s per row. I tried using functions such as table, ftable, and xtab, but they don't really work for more than three variables.
I would highly appreciate your help on this.

If I understand correctly:
library(dplyr)
library(tidyr)
df %>%
transmute(count0 = rowSums(df==0),
count1 = rowSums(df==1)) %>%
pivot_longer(everything()) %>%
count(name, value)
name value n
<chr> <dbl> <int>
1 count0 5 1
2 count0 6 1
3 count0 7 1
4 count0 11 1
5 count0 12 1
6 count1 4 1
7 count1 5 1
8 count1 9 1
9 count1 10 1
10 count1 11 1

Related

identify cases when a value of a column appeared in that same column while the value of another column changed

I have a dataframe with two columns, one is composed by card IDs and the other is the number of the week (an integer). Each use of the card generates a row in the dataframe with that card ID and the number of the week in which it was used. So for a particular week number y have multiple rows with different card IDs and also the same card ID can be repeated multiple times for the same number of the week. What I want to di is to determine how many NEW cards are used each week compared to the previos week, the previous two weeks and so on.
I mean, I want to add a column to the dataframe with the name for example "used_last_week" that adopts a 1 if that card ID was used in the last week and a 0 if not. Then add another column to the dataframe that adopts a 1 if that card ID was used in the previous week of last week and a 0 if not. And so on untill 4 weeks prior.
How can I do this ? I thought of a foor and while loop, but I couldn´t pull it out.
Thank you verymuch.
PD: the card ID variable and the number of the week variable are both numeric.
What I have now is this:
card_id num_week
123234 1
124531 1
345124 1
451433 1
512453 2
123234 2
124531 2
235467 2
145246 3
134353 3
512453 3
123234 3
And I want to result in something like this:
card_id num_week used_week_prior used_2_weeks_priors
123234 1 0 0
124531 1 0 0
345124 1 0 0
451433 1 0 0
512453 2 0 0
123234 2 1 0
124531 2 1 0
235467 2 0 0
145246 3 0 0
134353 3 0 0
512453 3 1 0
123234 3 1 1
That´s the idea but with columns all the way untill "used_4_weeks_prior"
For mutating 4 (or more) columns at once, you can adopt the following approach.
df <- read.table(text = 'card_id num_week
123234 1
124531 1
345124 1
451433 1
512453 2
123234 2
124531 2
235467 2
145246 3
134353 3
512453 3
123234 3', header = T)
library(tidyverse)
df %>%
bind_cols(map_dfc(1:4, ~df %>% group_by(card_id) %>%
transmute(!!paste0('weeks_prior', .x) := +((row_number() - .x) > 0)
) %>% ungroup() %>% select(-card_id)
))
#> card_id num_week weeks_prior1 weeks_prior2 weeks_prior3 weeks_prior4
#> 1 123234 1 0 0 0 0
#> 2 124531 1 0 0 0 0
#> 3 345124 1 0 0 0 0
#> 4 451433 1 0 0 0 0
#> 5 512453 2 0 0 0 0
#> 6 123234 2 1 0 0 0
#> 7 124531 2 1 0 0 0
#> 8 235467 2 0 0 0 0
#> 9 145246 3 0 0 0 0
#> 10 134353 3 0 0 0 0
#> 11 512453 3 1 0 0 0
#> 12 123234 3 1 1 0 0
Created on 2021-05-20 by the reprex package (v2.0.0)

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

R how to generate a descending sequence by subject measuring the distance from the next uninterrupted series of a given value

I have spent a lot of time trying to figure out how to create a descending sequence which is subject specific and measures the distance from the next uninterrupted series of a given value in another column. Do you have any suggestions?
Here is an example of the problem:
Given the following data, where the "id" column is the subject unique identifier and the column "dummy" is an attribute
mydata<-data.frame(id=rep(seq(1,3),each=5), dummy=c(0,0,0,1,1,0,0,1,0,1,0,0,0,0,0))
id dummy
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 2 0
7 2 0
8 2 1
9 2 0
10 2 1
11 3 0
12 3 0
13 3 0
14 3 0
15 3 0
Generate a new column measuring the distance from the next uninterrupted series of the value 1 in the "dummy" column (notice: I am considering an individual occurrence of the value 1 as an interrupted series). Here is an example of the output:
id dummy output
1 1 0 3
2 1 0 2
3 1 0 1
4 1 1 0
5 1 1 0
6 2 0 2
7 2 0 1
8 2 1 0
9 2 0 1
10 2 1 0
11 3 0 0
12 3 0 0
13 3 0 0
14 3 0 0
15 3 0 0
Thanks,
H
Here's an attempt using the data.table package in two steps.
First step is to shift the dummy column one step further in order to afterwards check if the zero sequences are being followed by one.
Second step is to calculate the sequences by condition that they are zero sequences and being followed by one.
I'm using the shift function from the latest data.table version (v 1.9.6+) for this task, but you can just use indx := c(dummy[-1L], 0L) instead
library(data.table) # V1.9.6+
setDT(mydata)[, indx := shift(dummy, type = "lead", fill = 0L)]
mydata[, output := .N:1L*(dummy == 0L)*(indx[.N] == 1L), by = .(id, cumsum(dummy == 1L))]
# id dummy indx output
# 1: 1 0 0 3
# 2: 1 0 0 2
# 3: 1 0 1 1
# 4: 1 1 1 0
# 5: 1 1 0 0
# 6: 2 0 0 2
# 7: 2 0 1 1
# 8: 2 1 0 0
# 9: 2 0 1 1
# 10: 2 1 0 0
# 11: 3 0 0 0
# 12: 3 0 0 0
# 13: 3 0 0 0
# 14: 3 0 0 0
# 15: 3 0 0 0
Here is an option with base R. First we label the number of consecutive identical entries (with rle) in the dummy column in reverse order:
mydata$output<- unlist(sapply(rle(mydata$dummy)$lengths,function(x) rev(seq(x))))
Then we set the values of the output column to zero for all rows in which dummy is not equal to zero:
mydata$output[mydata$dummy!=0] <- 0
In a last step, we identify the sets of id which only contain zeros as values for dummy and set their entries of the output column to zero, too:
mydata[mydata$id==which(aggregate(dummy ~ id,mydata,sum)$dummy==0),]$output <- 0
#> mydata
# id dummy output
#1 1 0 3
#2 1 0 2
#3 1 0 1
#4 1 1 0
#5 1 1 0
#6 2 0 2
#7 2 0 1
#8 2 1 0
#9 2 0 1
#10 2 1 0
#11 3 0 0
#12 3 0 0
#13 3 0 0
#14 3 0 0
#15 3 0 0
This solution assumes that there are no negative values in the dummy column.

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources