Convert Categorical Data into binary (1&0) in R Studio - r

I have a dataset comparing cases to categories of mental illness. In the data set, it is reported as 0 for no mental illness, 1 for mood disorders, 2 for behavioral disorders, 3 for other, and 4 for disorder like symptoms. I am trying to convert my dataset (mentallIllness) so that if you show any symptoms or have any disorder (I.e. you have a 1 to 4) it counts as a 1 (just yes that you have signs/have disorder) or 0 for no mental illness.
How can I go about that?
Thanks!

Suppose you have a vector with numbers from 0 to 4:
my_data <- c(0:4, 2, 3, 0)
my_data
#[1] 0 1 2 3 4 2 3 0
Here are a few ways to convert all the non-zeros to 1:
1*(my_data>0)
#[1] 0 1 1 1 1 1 1 0
as.numeric(my_data>0)
#[1] 0 1 1 1 1 1 1 0
In both of these cases, the term (my_data>0) tests each value in my_data to evaluate if it is greater than 0, if so the result is TRUE, otherwise FALSE. We can multiply TRUE/FALSE by 1, or convert to numeric, to change those to 1/0.
As Ben Bolker suggested, we could use ifelse to get the same results:
ifelse(my_data == 0, 0, 1)
#[1] 0 1 1 1 1 1 1 0
Your vector might live in a data frame, like:
my_df <- data.frame(my_data = c(0:4, 2, 3, 0))
We could use the same code to make a new variable, or overwrite the existing one:
my_df$recoded = ifelse(my_df$my_data == 0, 0, 1)
my_df
# my_data recoded
#1 0 0
#2 1 1
#3 2 1
#4 3 1
#5 4 1
#6 2 1
#7 3 1
#8 0 0

Related

r: how to simultaneously change multiple column names based on the individual suffix of each column name

I have received a datasheet p autogenerated from a registry and containing 1855 columns. The autogeneration adds _vX automatically to each column name where X correspond the number of follow-ups. Unfortunately, this creates ridiculously long column names.
Eg
p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10 and p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
correspond to the 10th and 20th MRI scan on the same patient. I.e., each column that addresses clinical parameters related to the 10th follow-up ends with v1_v2_v3_v4_v5_v6_v7_v8_v9_v10.
I seek a solution, preferably in dplyr or a function, that changes the entire _v1_v2_...." suffix to fuX corresponding to the xth follow-up.
Lets say that p looks like:
a_v2 b_v2_v3 a_v2_v3_v4 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
Data
p <- structure(list(dia_maxrd_v2 = c(0, 1), hear_sev_v2_v3 = c(1, 1), reop_ind_v2_v3_v4___1 = c(1,
0), neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(1,
1), symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(0,
0)), class = "data.frame", row.names = c(NA, -2L))
EDIT
To complicate things, some column names end with "___1" indicating a specific parameter relating to that clinical parameter and should be preserved, e.g.: _v1_v2_v3_v4___1. Hence, this is still to be considered as fu4 and the ___1 part should not be omitted.
a_v2 b_v2_v3 a_v2_v3_v4___1 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
EDIT
My apologies, the solution must consider the "basic" column name specifying what parameter the column contain, e.g. post-surgical complications. It is only the _v1_v2_v3..._vX-part that should be substituted with the corresponding fuX. What comes before and after the _v1_v2_v3..._vX-part must be preserved.
Consider
dia_maxrd_v2 hear_sev_v2_v3 reop_ind_v2_v3_v4___1 neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20 symp_pre_lokal_fu20
1 0 1 1 1 0
2 1 1 0 1 0
You can use gsub with two capturing groups:
names(p) <- gsub("^(.).*?(\\d+)$", "\\1_fu\\2", names(p))
p
#> a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
With new requirements stipulated by OP for including in pipe in having some different endings not in original question:
p %>% setNames(gsub("^(.).*?(\\d+_*\\d*)$", "\\1_fu\\2", names(.)))
#> a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
For arbitrary starting strings, it may be easiest to gsub twice:
p %>% setNames(gsub("(\\d{1,2}_v)+", "", names(.))) %>%
setNames(gsub("_v(\\d+)", "_fu\\1", names(.)))
#> dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20
#> 1 0 1 1 1
#> 2 1 1 0 1
#> symp_pre_lokal_fu20
#> 1 0
#> 2 0

Lagging vector adding 1 while resetting to 0 when a condition is met

I have a sequence of treatments, one per day (binary), say:
trt <- c(0, 0, 1, 0, 0, 0, 1, 0, 0)
I want to create a vector, days_since, that:
Is NA up until the first treatment.
Is 0 where trt is 1
Counts the days since the last treatment
So, the output days_since should be:
days_since <- c(NA, NA, 0, 1, 2, 3, 0, 1, 2)
How would I do this in R? To get days_since, I basically need to lag by one element and add 1, but resetting every time the original vector (trt) is 1. If this is doable without a for-loop, that would be ideal, but not absolutely necessary.
Maybe you can try the code below
v <- cumsum(trt)
replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
which gives
[1] NA NA 0 1 2 3 0 1 2
Explanation
First, we apply cumsum over trt to group the treatments
> v <- cumsum(trt)
> v
[1] 0 0 1 1 1 1 2 2 2
Secondly, using ave helps to add sequential indices within each group
> ave(trt,v,FUN = seq_along)-1
[1] 0 1 0 1 2 3 0 1 2
Finally, since the value is NA before the first treatment, it means all the value before v == 1 appears should be replaced by NA. Thus we use replace, and the index logic follows v < 1
> replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
[1] NA NA 0 1 2 3 0 1 2
We can also use
(NA^!cummax(trt)) * sequence(table(cumsum(trt)))-1
#[1] NA NA 0 1 2 3 0 1 2
Or with rowid from data.table
library(data.table)
(NA^!cummax(trt)) *rowid(cumsum(trt))-1
#[1] NA NA 0 1 2 3 0 1 2

How to automate recoding of many variables using mutate_at and nested ifelse statement?

There is a large data set consisting of repeated measures of the same variable on each subject. An example data is as below
df<-data.frame(
"id"=c(1:5),
"ax1"=c(1,6,8,15,17),
"bx1"=c(2,16,8,15,17))
where "x1" is measured repeatedly so we can have "ax1", "bx1", "cx1" and so on. I am trying to recode these variables. The plan is to recode 1 and any number on the range from 3 to 12 (inclusively) as 0 and recode 2 or any value greater than or equal to 13 as 1. Because it involves many variables I am making use of "mutate_at" to automate the recoding. Also, the numbers to take on the same code are not consecutive (e.g. 1 and 3-12 to be recoded as 0) so I used a nested "ifelse" statement. I tried the following
df1<-df %>%
mutate_at(vars(ends_with("x1")),factor,
ifelse(x1>=3 & x1 <=12,0,ifelse(x1==1, 0,
ifelse(x1==2, 1,0))))
However, this fails to work because R cannot recognize "x1". Any help on this is greatly appreciated in advance. The expected output would look like
> df1
id ax1 bx1
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 1
5 5 1 1
Using ifelse, we can proceed as follows:
df %>%
mutate_at(vars(ends_with("x1")),~ifelse(. ==1 | . %in% 3:12,0,
ifelse(. ==2 | .>=13,1,.)))
id ax1 bx1
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 1
5 5 1 1
We can use case_when
library(dplyr)
df %>%
mutate_at(vars(ends_with("x1")), ~case_when((. >= 3 & . <= 12) | . == 1 ~ 0,
. >= 13 | . == 2 ~ 1))
# id ax1 bx1
#1 1 0 1
#2 2 0 1
#3 3 0 0
#4 4 1 1
#5 5 1 1
Here is another solution similar to what you where attempting. I just added the "or" operator (|) to make a simpler ifelse and removed the factor part from your code.
library(dplyr)
df1<-df %>%
mutate_at(vars(ends_with("x1")), function(x)
ifelse(x >= 3 & x <= 12 | x == 1,0,
ifelse(x >= 13 | x == 2, 1,0)))
# id ax1 bx1
#1 1 0 1
#2 2 0 1
#3 3 0 0
#4 4 1 1
#5 5 1 1
If there are no other possible conditions apart from the ones you mention (for example, having zeros), I think you could simplify it more by just reducing it to the following:
df1<-df %>%
mutate_at(vars(ends_with("x1")), function(x)
ifelse(x >= 3 & x <= 12 | x == 1, 0, 1))

How to compare the count of unique values

I need to check whether the number of elements of each unique value in the variable PPT in A is equal to the number of elements of each unique value in PPT in B, and whether there is any value unique only to A or only to B.
For example:
PPTa <- c("ppt0100109","ppt0301104","ppt0100109","ppt0100109","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0504409","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNa <- c(110,54,110,110,49,10,49,110,409,40,10,10,110)
LLa <- c(150,55,150,150,45,15,45,115,405,45,5,15,50)
A <-data.frame(PPTa,CNa,LLa)
PPTb <- c("ppt0100200","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0764091","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNb <- c(110,54,110,110,49,10,49,110,409,40)
LLb <- c(150,55,150,150,45,15,45,115,405,45)
B <-data.frame(PPTb,CNb,LLb)
In this case, we have these unique values which occur a certain amount of times:
A$PPTa TIMES
"ppt0100109" 6
"ppt0301104" 1
"ppt0300249" 2
"ppt0504409" 1
"ppt2303401" 1
"ppt0704210" 2
B$PPTb TIMES
"ppt0100200" 1
"ppt0300249" 2
"ppt0100109" 3
"ppt0764091" 1
"ppt2303401" 1
"ppt0704210" 2
I would like to create a new matrix (or anything you could suggest) with a value of 0 if the unique value exists both in A and B with the same number of elements, a value of 1 if it exists in both dataframes A and B but the number of elements differ, and a value of 2 if the value exists only in one of the two dataframes.
Something like:
A$PPTa TIMES OUTPUT
"ppt0100109" 6 1
"ppt0301104" 1 2
"ppt0300249" 2 0
"ppt0504409" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
B$PPTb TIMES OUTPUT
"ppt0100200" 1 2
"ppt0300249" 2 0
"ppt0100109" 3 1
"ppt0764091" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
You can use a nested ifelse statement,
ifelse(do.call(paste0, A) %in% do.call(paste0, B), 0, ifelse(A$PPTa %in% B$PPTb, 1, 2))
#[1] 1 0 2 2 0 0
ifelse(do.call(paste0, B) %in% do.call(paste0, A), 0, ifelse(B$PPTb %in% A$PPTa, 1, 2))
#[1] 1 2 0 0 2 0

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Resources