Passing the list of strings as input to a function - r

I am trying automate a simple task in R using a function.
C is list of character variables. mydata- is the dataset.
Basically, I need to give each of the strings in vector C as an input to the function.
dataset:
mydata <- structure(list(a = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L), b = c(4L,3L, 1L, 2L, 1L, 5L, 2L, 2L), c = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), t = c(42L, 34L, 74L,39L, 47L, 8L, 36L, 39L), s = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), .Names = c("a", "b", "c", "d", "t", "s"), row.names = c(NA,8L), class = "data.frame")
code:
c<-c("a","b","c","d")
plot<-function()
for (i in c)
{
fit<-survfit(Surv(s,t)~paste(i), dat=mydata)
ggsurvplot(fit, pval = TRUE)
}
plot()
I m facing the following error:
Error in model.frame.default(formula = Surv(mydata$s, mydata$t) ~
paste(i), : variable lengths differ (found for 'paste(i)')
I have tried the reformulate as well:
plot<-function()
for (i in c)
{
survfit(update(Surv(s,t)~., reformulate(i)), data=mydata)
ggsurvplot(fit, pval = TRUE)
}
plot()
but this code also gives this error:
Error in reformulate(i) : object 'i' not found
Any help to make this code work?
Thanks

Building formulas dynamically can be tricky. Rather than
fit(Surv(mydata$s,mydata$t)~paste(i), dat=mydata)
use
fit(update(Surv(s,t)~., reformulate(i)), data=mydata)
You should avoid using $ with formulas. Here reformualte() helps to build a formula from a string and update combines parts of formulas. See the help pages for these functions if you would like more details.
Here's the full working version with the sample inout
#sample input
mydata <- structure(list(a = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L), b = c(4L,3L, 1L, 2L, 1L, 5L, 2L, 2L), c = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), t = c(42L, 34L, 74L,39L, 47L, 8L, 36L, 39L), s = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), .Names = c("a", "b", "c", "d", "t", "s"), row.names = c(NA,8L), class = "data.frame")
c<-c("a","b","c","d")
and the code
library(survival)
library(survminer)
plot <- function() {
for (i in c) {
fit <- survfit(update(Surv(t,s)~., reformulate(i)), data=mydata)
ggsurvplot(fit)
}
}
plot()
When I copy/paste that into R I do not get any errors. You must be doing something different than the sample code you've posted.

Related

R code of scatter plot for four variables

I tried plotting ASB vs YOI for each Child grouped by Race
I got something like:
library(tidyverse)
Antisocial <- structure(list(Child = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), ASB = c(1L, 1L, 1L, 0L, 0L, 0L, 5L, 5L, 5L, 2L), Race = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Y92 = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Y94 = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), YOI = c(90L, 92L, 94L, 90L, 92L, 94L, 90L, 92L, 94L, 90L)), row.names = c(NA, 10L), class = "data.frame")
ggplot(data = Antisocial, aes(x = YOI, y = ASB)) +
geom_point( colour = "Black", size = 2) +
geom_line(data = Antisocial, aes(x= Child), size = 1) +
facet_grid(.~ Race)
Plot Image I generated: https://drive.google.com/file/d/1sZVsRFiGC0dIGg0GWhHhNDCaiW2iB-ky/view?usp=sharing
Full dataset- https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
I want to use 2 charts side by side Race=0, Race= 1 to plot ASB vs YOI for each Child grouped by Race. The line, however, should only connect to dots of the same child. As it is right now, all the dots are connected. Furthermore the scale of YOI should be (90,94).
Can you suggest what change should I do?
Thanks!
Thanks for providing the data. I changed 4 observations to race 0 to have some variation:
library(tidyverse)
Antisocial <- structure(list(Child = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), ASB = c(1L, 1L, 1L, 0L, 0L, 0L, 5L, 5L, 5L, 2L), Race = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), Y92 = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Y94 = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), YOI = c(90L, 92L, 94L, 90L, 92L, 94L, 90L, 92L, 94L, 90L)), row.names = c(NA, 10L), class = "data.frame")
ggplot(data = Antisocial, aes(x = YOI, y = ASB, , group = Child)) +
geom_point( colour = "Black", size = 2) +
geom_line()+
facet_grid(.~ Race)
To connect the dots for each child, you need to include group = Child in the code. I think this is what you want? Let me know if this solved your problem :)

create a dataframe for multiple line plot for ggplot R

This question is about arranging data for a ggplot line plot. I have been doing this manually with excel and I want to work out a way to do this using r.
I have reviewed this post which is similar
Arrange dataframe format for ggplot - R
I have a dataset that looks like this:
]1
I want to convert it to a dataframe that is divided into the groups (N,A,G) and into age brackets and the proportion per age_group.
An example of what I am trying to achieve:
Appreciate your help.
Data:
structure(list(ID = 1:10, Age = c(9L, 16L, 12L, 13L, 29L, 24L,
23L, 24L, 16L, 40L), Sex = structure(c(1L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), Age_group =
c(1L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 4L), N = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 0L), A = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
0L), G = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
We can pivot to 'long' format with pivot_longer and then create a grouping variable with cut on the 'Age' and get the sum of 'n' and 'proportion'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = N:G, names_to = 'group', values_to = 'n') %>%
group_by(Age_group_new = cut(Age, breaks = c(-Inf, 0, seq(10, 70, by = 10), 100, Inf)), group) %>%
summarise(n = sum(n)) %>%
group_by(Age_group_new) %>%
mutate(proportion = n/sum(n),
proportion = replace(proportion, is.nan(proportion), 0))

How to write a function to count the number of observations based on specific conditions in R?

I have a data frame of 1401 observations of 16 variables. For each column (except the first one), I have either 1 (if a condition is met) or 0 (if a condition is not met). Overall, the idea is to count how many observations meet certain conditions successively. We can think about it as a decision tree: in the first branch you can have either 1 (condition is met) or 0 (condition is not met), in the second branch starting from the 0 of the first branch, you can also have 1 or 0, etc... In my data frame, branches are columns. I want to investigate the impact of looking at the different branches (columns) in various orders.
My idea is to count the number of "1" in column Cn if I know that there was a "0" in column Cn-1.
dput(droplevels(head(data,20)))
structure(list(Substance = structure(c(13L, 9L, 10L, 12L, 1L,
19L, 16L, 17L, 5L, 2L, 14L, 7L, 4L, 6L, 20L, 18L, 15L, 3L, 11L,
8L), .Label = c("104653-34-1", "107-02-8", "111-30-8", "12057-74-8",
"122454-29-9", "14915-37-8", "20859-73-8", "27083-27-8", "28772-56-7",
"3691-35-8", "55965-84-9", "56073-07-5", "56073-10-0", "5836-29-3",
"71751-41-2", "74-90-8", "81-81-2", "86347-14-0", "90035-08-8",
"91465-08-6"), class = "factor"), colA = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
colB = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), colC = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), colD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L), colE = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 1L), colF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L), colG = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L), colH = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), colI = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L
), colK = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), colJ = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L), colL = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L,
0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), colM = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), colN = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), colO = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("Substance",
"Oral", "Dermal", "Inhalation", "SC", "SED", "RS", "SS", "M",
"C", "R", "STOT.SE", "STOT.RE", "AT", "Eco.Acute", "Eco.Chronic"
), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 12L, 13L,
14L, 17L, 18L, 19L, 20L, 21L, 22L, 28L, 34L), class = "data.frame")
#I define the order in which I look at the columns
orderA <- colnames(data)[2:16]
#no-yes function counts chemicals which meet condition Cn when condition Cn-1 is not met
count_no_yes <- function(data, cols) {
data <- data[, cols]
sum(apply(data, 1, function(x) all(x == 1)))
}
endpoints <- 0:15
#scenario A with order A of the columns
counts <- sapply(1:15, function(i) count_no_yes(data, orderA[1:i]))
counts <- c(nrow(data), counts)
scenarioA <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioA")
My problem is that I don't know how to include the information from the previous observation in my code. The current is not working. I get the following error: Error in apply(data, 1, function(x) all(x == 1)):dim(X) must have a positive length.
The idea is then to plot the number of observations that meet the conditions for each branch of the tree (column).
#scenario B with a different order of the columns
orderB <- colnames(data)[c(9, 10, 11, 5, 6, 8, 3, 2, 4, 13, 12, 7, 14, 15, 16)]
counts <- sapply(1:15, function(i) count_yes_yes(data, orderB[1:i]))
counts <- c(nrow(data), counts)
scenarioB <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioB")
#combine the different scenarios and plot
scenarios <- rbind(scenarioA, scenarioB)
library(ggplot2)
ggplot(scenarios, aes(x=endpoint, y=hits, color=scenario, group=scenario)) +
geom_point() +
geom_line()
Could it be this?
we tidy the data with tidy::gather then dplyr::group_by(par) and count the number of times a 0 is followed by a 1.
my.fun <- function(x) {
#Values
v <-rle(x)[[2]]
#Consecutive lenght
l <- rle(x)[[1]]
tmp <- data.frame(v = v, l=l)
tmp <-
tmp %>%
# for each column find a substance with
# 1 which came after a substance with value 0
# and check that 1 is followed by a zero
mutate(flag = ifelse(v==1 & lag(v)==0 & lead(v) == 0, 1, 0))
#return the sum of the `flag`value
sum(tmp$flag, na.rm = TRUE)
}
df %>%
tidyr::gather("par", "value", everything(), -Substance) %>%
group_by(par) %>%
summarise(c = my.fun(value))
# A tibble: 15 x 2
par c
<chr> <dbl>
1 AT 0
2 C 0
3 Dermal 0
4 Eco.Acute 1
5 Eco.Chronic 0
6 Inhalation 0
7 M 0
8 Oral 0
9 R 4
10 RS 1
11 SC 2
12 SED 1
13 SS 0
14 STOT.RE 4
15 STOT.SE 3
the rle function is a real gem for analyzing consecutiveness in a vector.
The my.fun can probably be adjusted to your exact needs.

How to mutate a column using dplyr with a value when any of the columns contain a 1 otherwise 0

events <- structure(list(ID = c(3049951, 3085397, 3204081, 3262134,
3467254), TVTProcedureStartDate = structure(c(16210, 16238, 16322,
16420, 16546), class = "Date"), DCDate = structure(c(16213, 16250,
16326, 16426, 16560), class = "Date"), CE_EventOccurred = c(0L,
0L, 0L, 0L, 0L), CE_EventDate = c(0L, 0L, 0L, 0L, 0L), `Annular Dissection (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Aortic Dissection (In Hospital)` = c(0L, 0L,
0L, 1L, 0L), `Atrial Fibrillation (In Hospital)` = c(0L, 1L,
0L, 0L, 1L), `Bleeding at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Cardiac Arrest (In Hospital)` = c(1L, 0L, 0L,
0L, 0L), `Conduction/Native Pacer Disturbance Req ICD (In Hospital)` = c(0L,
0L, 1L, 0L, 0L), `Conduction/Native Pacer Disturbance Req Pacer (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Endocarditis (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `GI Bleed (In Hospital)` = c(0L, 0L, 0L, 0L, 0L), `Hematoma at Access Site (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Ischemic Stroke (In Hospital)` = c(0L, 0L,
0L, 0L, 0L), `Major Vascular Complications (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Minor Vascular Complication (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Leaflet Injury - detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Mitral Subvalvular Injury -detected during surgery (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `New Requirement for Dialysis (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Other Bleed (In Hospital)` = c(0L, 0L, 0L,
0L, 0L), `Perforation with or w/o Tamponade (In Hospital)` = c(1L,
0L, 0L, 0L, 0L), `Retroperitoneal Bleeding (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Single Leaflet Device Attachment (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Other Cardiac Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 0L, 0L), `Unplanned Vascular Surgery or Intervention (In Hospital)` = c(0L,
0L, 0L, 1L, 0L)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254
)), class = "data.frame", row.names = c(NA, -5L), vars = "NCDRPatientID", labels = structure(list(
NCDRPatientID = c(3049951, 3085397, 3204081, 3262134, 3467254,
3467324, 3510387, 3586037, 3661089, 3668621, 3679485, 3737916,
3738064, 3960141, 4006862, 4018241, 4019056, 4025174, 4027490,
4050900, 4051101, 4096816, 4097119, 4097146, 4097180, 4098426,
4106410, 4109968, 4147466, 4198427, 4198450, 4198458, 4204554,
4208053, 4213116, 4218802, 4218854, 4223378, 4223415, 4243959,
4316979, 4341660, 4348676, 4413567, 4419513, 4421948, 4422768,
4426483, 4430159, 4431211, 4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", labels = structure(list(NCDRPatientID = c(3049951,
3085397, 3204081, 3262134, 3467254, 3467324, 3510387, 3586037,
3661089, 3668621, 3679485, 3737916, 3738064, 3960141, 4006862,
4018241, 4019056, 4025174, 4027490, 4050900, 4051101, 4096816,
4097119, 4097146, 4097180, 4098426, 4106410, 4109968, 4147466,
4198427, 4198450, 4198458, 4204554, 4208053, 4213116, 4218802,
4218854, 4223378, 4223415, 4243959, 4316979, 4341660, 4348676,
4413567, 4419513, 4421948, 4422768, 4426483, 4430159, 4431211,
4433156, 4433406, 4433988)), class = "data.frame", row.names = c(NA,
-53L), vars = "NCDRPatientID", drop = TRUE), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10:12, 13L, 14L, 15L,
16:17, 18L, 19:21, 22L, 23L, 24L, 25:26, 27L, 28L, 29:30,
31L, 32:33, 34L, 35:38, 39L, 40:41, 42L, 43L, 44L, 45L, 46L,
47L, 48:50, 51:53, 54L, 55L, 56L, 57L, 58L, 59:60, 61L, 62L,
63:64, 65:66, 67:68, 69L, 70L, 71:72, 73L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 4L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 1L), biggest_group_size = 4L), indices = list(0L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,
51L, 52L), drop = TRUE, group_sizes = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), biggest_group_size = 1L), indices = list(0L, 1L, 2L, 3L, 4L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L), biggest_group_size = 1L)
From this data, I need to create a column that has value 1 if any of the columns which ends in (in-hospital) contains 1 else 0.
I tried multiple things but either doesn't work or displays error
Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
event %>% mutate(TR = rowSums(select_(.,6:n)))
Error in mutate_impl(.data, dots) : Column `TR` must be length 1 (the group size), not 53
event %>% mutate(TR = rowSums(.[6:ncol(.)]))
And some other variations of it to see if I can understand or make some sense, but it keeps running into the similar errors and problems
Another thing i tried was the following which seems to do the row sums, but it also adds the ID even when I'm doing the following:
event %>% select(6:27) %>% rowSums()
but it added the ID with the 1s and 0s from columns 6 to 27 for each row. Not sure why it's doing this.
I want the results as a data frame with the same data, but also a column with 1s if any of the columns from 6 to 27 contains 1 otherwise 0
Before I developed my solution, I ran the following code to ungroup your data.
library(dplyr)
events <- events %>% ungroup()
Solution 1: rowSums with selected columns
The idea of this solution is to use rowSums to add all the numbers from the selected columns, determine if the sum is larger than 0, and then convert the logical vector to an integer vector (with 1 or 0).
There are many ways to select the columns. We can select based on column numbers.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., 6:27)) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use ends_with.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., ends_with("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use matches. The regular expression \\(In Hospital\\)$ indicates the string at the end.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., matches("\\(In Hospital\\)$"))) > 0))
events2$Col
# [1] 1 1 1 1 1
We can use contains, but notice that the target string does not need to be in the end of the column names.
events2 <- events %>% mutate(Col = as.integer(rowSums(select(., contains("(In Hospital)"))) > 0))
events2$Col
# [1] 1 1 1 1 1
Solution 2: apply with max
Since the numbers from the target columns are all 1 or 0, we can use apply with max to get the maximum, which will be 1 if there ara any 1, or 0. All the ways to use the select function as was shown above will also work here. Below I presented one way to do this.
events2 <- events %>% mutate(Col = apply(select(., ends_with("(In Hospital)")), 1, max))
events2$Col
# [1] 1 1 1 1 1
It is not a dplyr way, but it also works:
events$new_col <- 0
events$new_col[rowSums(events[, grep("In Hospital", colnames(events))]) >= 1] <- 1
A solution from base R using apply()
cols <- grep("in hospital", colnames(events), ignore.case = T)
apply(events[, cols], 1, function(x) ifelse(any(x == 1), 1, 0))
# [1] 1 1 1 1 1

R: Recoding multiple dummy variables into a single variable and replacing the corresponding dummy value with the variable name

I have a dataset with 14 mutually exclusive categories of call type all coded as dummy variables. Here is a small sample:
dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS",
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "data.frame", row.names = c(NA,
-10L))
I want to combine each of the dummy variables into a single new variable called "QUEUE" that replaces the value of "1" with the name of the dummy variable its corresponding dummy variable. Here is an example of what this would look like:
dput(df2)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), QUEUE = structure(c(1L, 4L, 2L, 4L, 1L, 3L,
3L, 5L, 5L, 4L), .Label = c("CLAIMS", "CONTENT", "CREDIT_CARD",
"DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "QUEUE"), class = "data.frame", row.names = c(NA,
-10L))
Edit in response to having question marked down: This is what I had tried this afternoon on recommendation with a slightly different sample dataframe:
df$Queue <- as.factor(df$CONTENT + df$CLAIMS*2 + df$CREDIT_CARD*3 + df$DEDUCT_BILL*4 + df$HCREFORM*5)
levels(df$Queue) <- c("CONTENT", "CLAIMS", "CREDIT_CARD","DEDUCT_BILL","HCREFORM")
View(df)
But I received a column of NA's in the Queue column. So, I recreated another sample dataset here. This dataframe is adequately representative of what I'll receive in reality, except I'll have about 40 variables and 2 million rows. When I run what I tried above on "df" above I get the following incorrect result:
dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Queue = structure(c(2L,
1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CONTENT",
"CLAIMS", "CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12",
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS",
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM", "Queue"), row.names = c(NA,
-10L), class = "data.frame")
I also tried:
df3 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))
but received the following error: "Error in data.frame("CLAIMS", character(0), character(0), "DEDUCT_BILL", :
arguments imply differing number of rows: 1, 0:
You could use max.col to get the column index that have a value of '1' in each row for columns 5 to 9. (The 'df' example is not correct as most of the rows were all 0s. The corrected one is below).
df$QUEUE <- names(df)[-c(1:4)][max.col(df[-c(1:4)])]
Or you can do
df$QUEUE <- names(df)[-(1:4)][(as.matrix(df[-(1:4)]) %*%
seq_along(df[-(1:4)]))[,1]]
Update
Based on the edit dataset 'df', some rows are all '0's for the columns 5:9, and in the expected result, it is showed that 'QUEUE' as 'CONTENT'. In that case, we can first modify the 'CONTENT' column to change the values where rows are all 0's and then apply either of the code above
df$CONTENT[!rowSums(df[5:9])] <- 1
df$QUEUE1 <- names(df)[5:9][max.col(df[5:9])]
df$QUEUE1
#[1] "CLAIMS" "CONTENT" "CONTENT" "DEDUCT_BILL" "CONTENT"
#[6] "CONTENT" "CONTENT" "CONTENT" "CONTENT" "CONTENT"
data
df <- structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
AGENT_ID = structure(c(3L,
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360",
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L,
5L, 1L, 1L, 3L), CONTENT = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0), CLAIMS = c(1,
0, 0, 0, 1, 0, 0, 0, 0, 0), CREDIT_CARD = c(0, 0, 0, 0, 0, 1,
1, 0, 0, 0), DEDUCT_BILL = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 1),
HCREFORM = c(0,
0, 0, 0, 0, 0, 0, 1, 1, 0)), .Names = c("MON1_12", "WEEK1_53",
"AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", "CREDIT_CARD",
"DEDUCT_BILL", "HCREFORM"), row.names = c(NA, -10L), class = "data.frame")
This should produce the desired result:
df2 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))
provided that only one and exactly one of the dummy variables is 1 in any of the rows (which is not true in your original sample of df).
Explanation: df[1:4] selects the columns one through four to be preserved in the output. It is then column bound to QUEUE using cbind function. QUEUE is obtained by iterating through the dummy variables (columns five through nine), row-wise over the data set df and selecting the column-name that contains the value one.

Resources