Create presence/absence variables from character string for long data - r

Let's say I have a data frame like this:
dat<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8))
answer= c("undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach"))
The answer column represents a "select all the apply" answer type- where the underscore separates selected answer options. For each ID, test and item, I would like to change this single variable to multiple presence/absence variables indicating the presence or absence of that answer component in the string. 1 indicates that answer option was present in the respondents answer and 0 represents that component was absent. The variables undergraduate, graduate, employment and volunteer in res correspond to the following strings in answer, respectivley: undergraduateeducation, graduateorprofessionalschool,employment, volunteeractivityoroutreach. White spaces were removed.
The result data frame would look as follows:
res<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8)),
undergraduate= c(1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1),
graduate= c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1),
employment=c(1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1),
volunteer=c(0,0,1,1,1,0,1,1,0,0,1,1,1,0,1,1))

In base R you could do:
new_cols <- c('undergraduate', 'graduate', 'employment', 'volunteer')
cbind(dat[1:3],
as.data.frame(do.call(rbind, lapply(strsplit(dat$answer, "_"),
function(x) {
z <- sapply(new_cols, function(y) as.numeric(grepl(paste0("\\b", y), x)))
if(is.vector(z)) z else colSums(z)
}))))
#> ID test item undergraduate graduate employment volunteer
#> 1 A pre item1 1 1 1 0
#> 2 B post item1 0 1 0 0
#> 3 C pre item1 1 1 1 1
#> 4 D post item1 0 0 0 1
#> 5 A pre item1 1 0 1 1
#> 6 B post item1 0 0 1 0
#> 7 C pre item1 0 0 0 1
#> 8 D post item1 1 1 1 1
#> 9 A pre item2 1 1 1 0
#> 10 B post item2 0 1 0 0
#> 11 C pre item2 1 1 1 1
#> 12 D post item2 0 0 0 1
#> 13 A pre item2 1 0 1 1
#> 14 B post item2 0 0 1 0
#> 15 C pre item2 0 0 0 1
#> 16 D post item2 1 1 1 1
Created on 2022-05-05 by the reprex package (v2.0.1)

One option is to use tidyverse to separate the data into rows on the _, then keep only the keywords (which will be used for column names). Then, we create a value column to note presence, then we can pivot to wide format, and fill the other values with 0.
library(tidyverse)
result <- dat %>%
mutate(rn = row_number()) %>%
separate_rows(answer, sep = "_") %>%
mutate(answer = str_extract(answer, "undergraduate|graduate|employment|volunteer"),
value = 1) %>%
pivot_wider(names_from = "answer", values_from = "value", values_fill = 0) %>%
select(-rn)
Output
ID test item undergraduate graduate employment volunteer
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A pre item1 1 1 1 0
2 B post item1 0 1 0 0
3 C pre item1 1 1 1 1
4 D post item1 0 0 0 1
5 A pre item1 1 0 1 1
6 B post item1 0 0 1 0
7 C pre item1 0 0 0 1
8 D post item1 1 1 1 1
9 A pre item2 1 1 1 0
10 B post item2 0 1 0 0
11 C pre item2 1 1 1 1
12 D post item2 0 0 0 1
13 A pre item2 1 0 1 1
14 B post item2 0 0 1 0
15 C pre item2 0 0 0 1
16 D post item2 1 1 1 1
Test
identical(result, as_tibble(res))
#[1] TRUE

Related

extract duplicate row based on condition across column in R

I'm stuck trying to keep row based on condition in R. I want to keep row of data based on the same condition across a large number of columns. So in the below example I want to keep rows from duplicated rows where hv value '0' at each column.
here is the data frame:
ID A B C
1 001 1 1 1
2 002 0 1 0
3 002 1 0 0
4 003 0 1 1
5 003 1 0 1
6 003 0 0 1
I want get like this:
ID A B C
1 001 1 1 1
2 002 0 0 0
3 003 0 0 1
Any help would be much appreciated, thanks!
Please check this code
# A tibble: 6 × 4
ID A B C
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 1 0
3 2 1 0 0
4 3 0 1 1
5 3 1 0 1
6 3 0 0 1
code
data2 <- data %>% group_by(ID) %>%
mutate(across(c('A','B','C'), ~ ifelse(.x==0, 0, NA), .names = 'x{col}')) %>%
fill(xA, xB, xC) %>%
mutate(across(c('xA','xB','xC'), ~ ifelse(is.na(.x), 1, .x))) %>%
ungroup() %>% group_by(ID) %>% slice_tail(n=1)
output
# A tibble: 3 × 7
# Groups: ID [3]
ID A B C xA xB xC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 2 1 0 0 0 0 0
3 3 0 0 1 0 0 1

Make a new column for every variable and tally [duplicate]

This question already has answers here:
R Split delimited strings in a column and insert as new column (in binary) [duplicate]
(3 answers)
Closed 4 months ago.
I have the following dataframe:
sample name
1 a cobra, tiger, reptile
2 b tiger, spynx
3 c reptile, cobra
4 d sphynx, tiger
5 e cat, dog, tiger
6 f dog, spynx
and what I want to make from that is.
sample cobra tiger spynx reptile cat dog
1 a 1 1 0 1 0 0
2 b 0 1 1 0 0 0
3 c 1 0 0 1 0 0
4 d 0 1 1 0 0 0
5 e 0 1 0 0 1 1
6 f 0 0 1 0 1 1
so basically make a new column out of all the variables that are in the column: name. and put a 1 if a value is present in the df$name and 0 if it is not present.
all <- unique(unlist(strsplit(as.character(df$name), ", ")))
all <- all[!is.na(all)]
for(i in df){
df[i]<- 0 }
this gives me all the variables as 0's, and now I want to match it to the name column, and if it is present make a 1 out of the 0
How would you approach this?
With tidyr and dplyr...
library(tidyr)
library(dplyr, warn = FALSE)
df1 |>
separate_rows(name) |>
group_by(sample, name) |>
summarise(count = n(), .groups = "drop") |>
pivot_wider(names_from = "name", values_from = "count", values_fill = 0)
#> # A tibble: 6 × 8
#> sample cobra reptile tiger spynx sphynx cat dog
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 a 1 1 1 0 0 0 0
#> 2 b 0 0 1 1 0 0 0
#> 3 c 1 1 0 0 0 0 0
#> 4 d 0 0 1 0 1 0 0
#> 5 e 0 0 1 0 0 1 1
#> 6 f 0 0 0 1 0 0 1
Created on 2022-10-19 with reprex v2.0.2
data
df1 <- data.frame(sample = letters[1:6],
name = c("cobra, tiger, reptile",
"tiger, spynx",
"reptile, cobra",
"sphynx, tiger",
"cat, dog, tiger",
"dog, spynx"))

Is there a R function for preparing datasets for survival analysis like stset in Stata?

Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1

fill values between interval grouped by ID

I have a data set where subjects have a value of 1 or 0 at different times. I need a function or a piece of code to that feels with 1, the values of 0 between the first and last 1.
I have tried complete() and fill() but not doing what I want
I have the following data:
dat = tibble(ID = c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3),
TIME = c(1,2,3,4,5,6,7,8,9,10,
1,2,3,4,5,6,7,8,9,10,
1,2,3,4,5,6,7,8,9,10),
DV = c(0,0,1,1,0,0,1,0,0,0,
0,1,0,0,0,0,0,0,0,1,
0,1,0,1,0,1,0,1,0,0))
# A tibble: 30 x 3
ID TIME DV
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1
5 1 5 0
6 1 6 0
7 1 7 1
8 1 8 0
9 1 9 0
10 1 10 0
# ... with 20 more rows
I need the following output as shown in DV2:
dat = tibble(ID = c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3),
TIME = c(1,2,3,4,5,6,7,8,9,10,
1,2,3,4,5,6,7,8,9,10,
1,2,3,4,5,6,7,8,9,10),
DV = c(0,0,1,1,0,0,1,0,0,0,
0,1,0,0,0,0,0,0,0,1,
0,1,0,1,0,1,0,1,0,0),
DV2 = c(0,0,1,1,1,1,1,0,0,0,
0,1,1,1,1,1,1,1,1,1,
0,1,1,1,1,1,1,1,0,0))
# A tibble: 30 x 4
ID TIME DV DV2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 1 1
4 1 4 1 1
5 1 5 0 1
6 1 6 0 1
7 1 7 1 1
8 1 8 0 0
9 1 9 0 0
10 1 10 0 0
# ... with 20 more rows
With dplyr, you can do:
dat %>%
rowid_to_column() %>%
group_by(ID) %>%
mutate(DV2 = if_else(rowid %in% min(rowid[DV == 1]):max(rowid[DV == 1]),
1, 0)) %>%
ungroup() %>%
select(-rowid)
ID TIME DV DV2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 1 1
4 1 4 1 1
5 1 5 0 1
6 1 6 0 1
7 1 7 1 1
8 1 8 0 0
9 1 9 0 0
10 1 10 0 0
We can create a helper function, and apply it on every group, i.e.
f1 <- function(x) {
v1 <- which(x == 1)
x[v1[1]:v1[length(v1)]] <- 1
return(x)
}
with(dat, ave(DV, ID, FUN = f1))
#[1] 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0

Making a conditional variable based on last observation in temporal data

ID T V1
1 1 1
1 2 1
2 1 0
2 2 0
3 1 1
3 2 1
3 3 1
I need a to make two variables from these data. The first needs to be a 1 on the last observation only when V1 = 1, and then a 1 on the last observation for all cases. Ideal final product:
ID T V1 v2 v3
1 1 1 0 0
1 2 1 1 1
2 1 0 0 0
2 2 0 0 1
3 1 1 0 0
3 2 1 0 0
3 3 1 1 1
Thanks in advance.
in the package dplyr, you can group your data according a variable (according ID in your case) and make operations for each group. As one of your column (T) already counts the rank of each observation (within each group), you can combine with the function n() which returns the number of rows of each group in order to obtain what you want.
Suppose your data are in the dataframe df :
df %>%
group_by(ID) %>%
mutate(
v2 = 1 * (`T` == n()),
v3 = 1 * (`T` == n()) * (V1 == 1)
)
# A tibble: 7 x 5
# Groups: ID [3]
ID T V1 v2 v3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0
2 1 2 1 1 1
3 2 1 0 0 0
4 2 2 0 1 0
5 3 1 1 0 0
6 3 2 1 0 0
7 3 3 1 1 1

Resources