convert one factor column to multiple dichotomous columns in r - r

I have a dataset with PatientID and their diagnoses, and they are as follows :
Id Diagnoses
1 Nerve conditions (e.g., Multiple sclerosis, myasthenia gravis, Guillain-Barre syndrome, demyelinating polyneuropathy)
2 Gastrointestinal conditions (e.g., irritable bowl disease, ulcerative colitis, Chron's disease),Heart conditions,High blood pressure,Migraines/headaches
3 Heart conditions,Traumatic brain injury
4 Chronic pain,Heart conditions,Post-traumatic Stress Disorder (PTSD),Traumatic brain injury
5 Anxiety,Chronic pain,Depression,Sleep apnea
6 High blood pressure
7 High blood pressure
How can I split the Diagnoses column as follows :
Id Anxiety Depression Nerve conditions Sleep apnea Chronic Diseases AND SO ON....
1 0 0 0 1 1
2 1 1 1 1 1
3 1 1 1 1 0
4 0 0 1 1 1
5 1 0 0 0 1
6 1 1 1 0 1
7 1 1 0 1 0
I have tried this code, but I did not get the result:
df %>%
separate_rows(Diagnoses, sep=",") %>%
separate(Q2.3, into = c("Anxiety", "Depression, "THE REST OF CONDITIONS"), sep=":\\s*") %>%
mutate(anxiety1 = str_c("Anxiety", Anxiety))
I appreciate your help.,

Does this work:
library(stringr)
library(dplyr)
library(tidyr)
df %>% mutate(Diagnoses = str_remove(Diagnoses, ' \\(.*\\)?')) %>%
separate_rows(Diagnoses, sep = ',') %>% count(Id, Diagnoses, name = 'Cnt') %>%
pivot_wider(id_cols = Id, names_from = Diagnoses, values_from = Cnt, values_fill = list(Cnt = 0))
# A tibble: 7 x 11
Id `Nerve condition~ `Gastrointestina~ `Heart conditio~ `Traumatic brai~ `Chronic pain` `Post-traumatic ~ Anxiety Depression `Sleep apnea` `High blood pre~
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 0 0 0 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0 0 0
3 3 0 0 1 1 0 0 0 0 0 0
4 4 0 0 1 0 1 1 0 0 0 0
5 5 0 0 0 0 1 0 1 1 1 0
6 6 0 0 0 0 0 0 0 0 0 1
7 7 0 0 0 0 0 0 0 0 0 1
>

Related

How to convert a daatset where some subjects chose multiple answers into a dummy variables format?

I have this example dataset
df <- data.frame(subjects = 1:12,
Why_are_you_not_happy =
c(1,2,"1,2,5",5,1,2,"3,4",3,2,"1,5",3,4),
why_are_you_sad =
c("1,2,3",1,2,3,"4,5,3",2,1,4,3,1,1,1) )
And would like to convert it into a dummy variables format (based on the 5 answers of each question). Can someone guide me through an effective way ? thanks.
You can separate_rows for multiple choices, convert to dummy and summarise by subjects (to get one row per subjects, with all their choices).
library(fastDummies)
library(tidyr)
library(dplyr)
df %>%
separate_rows(Why_are_you_not_happy, why_are_you_sad) %>%
dummy_cols(c("Why_are_you_not_happy", "why_are_you_sad"),
remove_selected_columns = TRUE) %>%
group_by(subjects) %>%
summarise(across(everything(), max))
output
# A tibble: 12 × 11
subjects Why_are_you…¹ Why_a…² Why_a…³ Why_a…⁴ Why_a…⁵ why_a…⁶ why_a…⁷ why_a…⁸ why_a…⁹ why_a…˟
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 0 0 0 0 1 1 1 0 0
2 2 0 1 0 0 0 1 0 0 0 0
3 3 1 1 0 0 1 0 1 0 0 0
4 4 0 0 0 0 1 0 0 1 0 0
5 5 1 0 0 0 0 0 0 1 1 1
6 6 0 1 0 0 0 0 1 0 0 0
7 7 0 0 1 1 0 1 0 0 0 0
8 8 0 0 1 0 0 0 0 0 1 0
9 9 0 1 0 0 0 0 0 1 0 0
10 10 1 0 0 0 1 1 0 0 0 0
11 11 0 0 1 0 0 1 0 0 0 0
12 12 0 0 0 1 0 1 0 0 0 0

Selecting different variables for each participant to calculate a unique score

I have a dataframe where I would like to check if people identified their right theme from a memory test. Each participant saw a different stimuli(s), so doing so is slightly more complicated than I expected. The first participant, for instant, saw the suicide, the memory, and the time themes, so if they have a 1 in those variable columns thats good. If they have a 1 in a column that they didn't see, thats bad. For instant, participant 1 below correctly identified all of their images, because they were shown suicide, memory, and time, and have a 1 in that column, and a 0 in the other columns. However the next participant said they saw the memory column but didnt. I would like to create four additional columns that show 1 if they got the theme correctly (saw the theme and marked 1 or didnt see the theme and marked 0), and 0 if they got it incorrect (saw the theme and marked it 0 or didn't see the theme and marked it 1).
I'm a little at a loss on how to do this and appreciate the help!!!
list <- c("suicide memory time","suicide vomit time","vomit alcohol time"," ",
" ","alcohol suicide children")
id <- c(1:6)
suicide1<- c(1,1,0,0,0,1)
suicide2<- c(1,1,1,0,0,1)
memory1 <- c(1,0,0,1,0,0)
memory2 <- c(1,0,0,0,0,0)
alcohol<- c(0,1,1,1,1,1)
time<- c(1,0,1,1,1,0)
foil1<- c(0,0,0,0,0,0)
foil2 <- c(0,0,1,0,0,0)
df<- data.frame(list,id,suicide,memory,alcohol, time, foil1, foil2)
How do I create 8 new columns:
suicide1_score
memory2_score... etc that show 0/1 for each participant based on what they actually saw?
nms <- names(df)[3:8]
out <- t(sapply(strsplit(df$list, " "), match, x = nms, nomatch = 0L))
colnames(out) <- paste0(nms, "_score")
cbind(df, data.frame(+(out > 0)))
# list id suicide memory alcohol time foil1 foil2 suicide_score memory_score alcohol_score time_score foil1_score foil2_score
# 1 suicide memory time 1 1 1 0 1 0 0 1 1 0 1 0 0
# 2 suicide vomit time 2 1 0 1 0 0 0 1 0 0 1 0 0
# 3 vomit alcohol time 3 0 0 1 1 0 1 0 0 1 1 0 0
# 4 4 0 0 1 1 0 0 0 0 0 0 0 0
# 5 5 0 0 1 1 0 0 0 0 0 0 0 0
# 6 alcohol suicide children 6 1 0 1 0 0 0 1 0 1 0 0 0
Here is a very verbose approach using tidyverse and nnet libraries:
library(nnet)
library(tidyverse)
df %>%
select(list, id) %>%
separate_rows(list) %>%
mutate(list = as.factor(list)) %>%
cbind((class.ind(.$list) == 1)*1) %>% # nnet library
group_by(id) %>%
mutate(list = toString(list)) %>%
summarise(across(-c(list, V1), sum)) %>%
rename_with(., ~paste(., "score", sep = "_")) %>%
rename(id = id_score) %>%
right_join(df, by= "id") %>%
relocate(list:foil2, everything())
A tibble: 6 x 14
list suicide memory alcohol time foil1 foil2 id alcohol_score children_score memory_score suicide_score time_score vomit_score
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "suicide memory time" 1 1 0 1 0 0 1 0 0 1 1 1 0
2 "suicide vomit time" 1 0 1 0 0 0 2 0 0 0 1 1 1
3 "vomit alcohol time" 0 0 1 1 0 1 3 1 0 0 0 1 1
4 " " 0 0 1 1 0 0 4 0 0 0 0 0 0
5 " " 0 0 1 1 0 0 5 0 0 0 0 0 0
6 "alcohol suicide children" 1 0 1 0 0 0 6 1 1 0 1 0 0

Assign group identifiers to groups of rows falling between rows containing a string in R

I've been given an Excel file in which the end of each group of data is marked by a row that is blank except for one cell which contains a string like "Person 1", "Person 2", "Person 3", and so on. The data belonging to Person 1 are in rows preceding the row containing "Person 1", the data belonging to Person 2 are in the rows between the row with "Person 1" and the row containing "Person 2". This pattern is followed until the end of the file, where the last row contains a cell with "Person 100". To make things even more interesting, the "Person [n]" string is not always in the same column and the number of rows per person can vary. See the toy example below.
df_1 <- data.frame(iv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
iv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
iv3=c(rbinom(3,1,.4), "Person 1", rbinom(4,1,.4), NA, rbinom(2,1,.4), "Person 3"),
dv1=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), "Person 2", rbinom(2,1,.4), NA),
dv2=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA),
dv3=c(rbinom(3,1,.4), NA, rbinom(4,1,.4), NA, rbinom(2,1,.4), NA))
Yields this data frame
iv1 iv2 iv3 dv1 dv2 dv3
1 1 1 0 1 1 0
2 0 0 1 0 0 0
3 1 0 0 1 0 1
4 NA NA Person 1 <NA> NA NA
5 1 1 0 0 1 1
6 1 0 0 0 0 0
7 0 0 0 1 0 0
8 1 0 0 1 1 1
9 NA NA <NA> Person 2 NA NA
10 0 0 0 1 0 0
11 0 1 0 0 0 1
12 NA NA Person 3 <NA> NA NA
What I would like to do is create a new column ("Person_ID") that identifies the data belonging to each person, so Person_ID would equal 1 for rows belonging to Person 1, Person_ID would equal 2 for rows belonging to Person 2, and so on, as in the data frame below.
iv1 iv2 iv3 dv1 dv2 dv3 Person_ID
1 1 1 0 1 1 0 1
2 0 0 1 0 0 0 1
3 1 0 0 1 0 1 1
4 1 1 0 0 1 1 2
5 1 0 0 0 0 0 2
6 0 0 0 1 0 0 2
7 1 0 0 1 1 1 2
8 0 0 0 1 0 0 3
9 0 1 0 0 0 1 3
I would love a dplyr-based solution, but of course, I'm open to whatever works. Thanks!
We could do it this way:
The values in iv1:dv3 do not match because you did not set a seed:
First solution is depending on NAs that may interfere with other NA data.
The second solution is independent of NAs:
library(dplyr)
df_1 %>%
mutate(Person_ID=cumsum(is.na(iv1))+1) %>%
na.omit()
iv1 iv2 iv3 dv1 dv2 dv3 Person_ID
<int> <int> <chr> <chr> <int> <int> <dbl>
1 0 0 0 0 0 0 1
2 1 1 1 0 1 0 1
3 1 0 0 0 0 0 1
4 1 1 0 0 0 1 2
5 1 1 0 0 0 1 2
6 1 0 0 0 1 1 2
7 0 0 1 1 1 0 2
8 0 0 1 0 0 0 3
9 1 1 0 1 0 0 3
Another way could be:
library(tidyverse)
df_1 %>%
mutate(Person_ID = coalesce(iv3, dv1),
Person_ID = ifelse(str_detect(Person_ID, "Person"), parse_number(Person_ID), NA)) %>%
fill(Person_ID, .direction = "up") %>%
na.omit()
Here is another option:
library(tidyverse)
df_1 %>%
unite(Person_ID, everything(), sep = ",", remove = FALSE) %>%
mutate(Person_ID = str_extract(Person_ID, "(?<=Person )[0-9]*")) %>%
fill(Person_ID, .direction = "up") %>%
slice(-which(rowSums(t(apply(df_1, 1, grepl, pattern="Person"))) == 1))
Or another option could be:
df_1 %>%
mutate(across(everything(), ~str_extract(., "(?<=Person )[0-9]*")),
Person_ID = coalesce(iv3, dv1)) %>%
fill(Person_ID, .direction = "up") %>%
select(Person_ID) %>%
bind_cols(., df_1) %>%
na.omit()
Output
Person_ID iv1 iv2 iv3 dv1 dv2 dv3
1 1 0 1 1 1 0 0
2 1 0 0 1 1 0 0
3 1 0 1 1 1 1 0
4 2 1 1 1 0 0 0
5 2 0 0 0 0 1 1
6 2 1 1 0 1 0 0
7 2 1 0 1 0 1 1
8 3 1 1 1 1 0 0
9 3 0 0 1 0 0 1

Binary Variables Combinations Analysis in R

I have a data set, which has a lot of binary variables. For the ease of illustration, here is a smaller version with only 4 variables:
set.seed(5)
my_data<-data.frame("Slept Well"=sample(c(0,1),10,TRUE),
"Had Breakfast"=sample(c(0,1),10,TRUE),
"Worked out"=sample(c(0,1),10,TRUE),
"Meditated"=sample(c(0,1),10,TRUE))
In the above, each row corresponds to an observation. I am interested in analysing the frequency of each unique combination of the variables. For example, how many observations said that they both slept well and meditated, but did not have breakfast or worked out?
I would like to be able to rank the unique combinations from most frequently occurring to the least frequently occurring. What is the best way to go about coding that up?
You can use aggregate.
x <- aggregate(list(n=rep(1, nrow(my_data))), my_data, length)
#x <- aggregate(list(n=my_data[,1]), my_data, length) #Alternative
x[order(-x$n),]
# Slept.Well Had.Breakfast Worked.out Meditated n
#4 0 1 1 0 2
#1 0 0 0 0 1
#2 1 1 0 0 1
#3 0 0 1 0 1
#5 0 0 0 1 1
#6 1 0 0 1 1
#7 0 1 0 1 1
#8 0 0 1 1 1
#9 0 1 1 1 1
What about a dplyr solution:
library(dplyr)
my_data %>%
# group it
group_by_all() %>%
# frequencies
summarise(freq = n()) %>%
# order decreasing
arrange(-freq)
# A tibble: 9 x 5
Slept.Well Had.Breakfast Worked.out Meditated freq
<chr> <chr> <chr> <chr> <int>
1 0 1 1 0 2
2 0 0 0 0 1
3 0 0 0 1 1
4 0 0 1 0 1
5 0 0 1 1 1
6 0 1 0 1 1
7 0 1 1 1 1
8 1 0 0 1 1
9 1 1 0 0 1
Or with data.table:
res <- setorder(data.table(my_data)[,"."(freq = .N), by = names(my_data)],-freq)
res
Slept.Well Had.Breakfast Worked.out Meditated freq
1: 0 1 1 0 2
2: 1 0 0 1 1
3: 0 0 1 0 1
4: 0 0 0 0 1
5: 0 1 0 1 1
6: 0 1 1 1 1
7: 0 0 1 1 1
8: 0 0 0 1 1
9: 1 1 0 0 1

R - Common users across months

I have a transaction table with the following columns:
TransactionId UserId YearMonth Group
What I am trying to accomplish is to get unique users across different months.
Eg:
YearMonth Group UsersCountMonth1 UsersCountMonth2 UsersCountMonth3
201301 A 1000 900 800
201301 B 1200 940 700
201302 B 1300 1140 900
201303 A 12e0 970 706
Basically Month1 and Month2 are the incremental months based on YearMonth value for the record.
I am using this result to perform retention analysis.
I remember you were looking for a possibility to analyze subscription cohorts, yesterday. So I guess you can do
library(tidyverse)
set.seed(1)
n <- 100
df <- data.frame(
user = sample(1:20, n, T),
transDate = sample(seq(as.Date("2016-01-01"), as.Date("2016-12-31"), "1 month"), n, T),
group = sample(LETTERS[1:2], n, T)
)
diffmonth <- function(d1, d2) {
# http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
lt$year*12 + lt$mon
}
monnb(d2) - monnb(d1) + 1L
}
df %>%
group_by(user, group) %>%
mutate(cohort = min(transDate), month = diffmonth(cohort, transDate)) %>%
unite(cohort, cohort, group, remove = T) %>%
group_by(month, cohort) %>%
summarise(n=n()) %>%
spread(month, n, fill = 0, drop = F)
# # A tibble: 16 × 12
# cohort `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2016-01-01_A 5 1 0 1 1 1 1 0 2 0 0
# 2 2016-02-01_A 1 0 0 0 0 0 0 0 1 0 1
# 3 2016-02-01_B 4 1 2 1 0 1 2 0 1 1 0
# 4 2016-03-01_A 5 0 3 1 2 2 2 0 1 2 0
# 5 2016-03-01_B 4 0 0 0 2 0 1 0 0 0 0
# 6 2016-04-01_A 4 0 2 1 0 1 0 2 1 0 0
# 7 2016-04-01_B 1 0 0 0 0 0 0 0 0 0 0
# 8 2016-05-01_A 2 0 2 2 0 0 2 0 0 0 0
# 9 2016-05-01_B 1 0 0 1 0 0 2 0 0 0 0
# 10 2016-06-01_A 1 0 2 0 0 1 0 0 0 0 0
# 11 2016-06-01_B 4 0 0 0 0 1 1 0 0 0 0
# 12 2016-07-01_A 1 0 1 0 0 0 0 0 0 0 0
# 13 2016-08-01_B 4 1 1 0 0 0 0 0 0 0 0
# 14 2016-09-01_A 1 0 0 0 0 0 0 0 0 0 0
# 15 2016-10-01_B 1 0 0 0 0 0 0 0 0 0 0
# 16 2016-12-01_A 3 0 0 0 0 0 0 0 0 0 0

Resources