Related
I have data like this:
df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1,
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1,
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1,
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1,
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1,
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0,
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0,
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0,
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0,
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0,
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0,
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0,
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0,
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0,
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0,
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
That originally came from this Redcap input where each button was one of those columns:
And I need to create a column at the end (lets call it Level) with these possible inputs:
Cervical (any of the c buttons)
Thoracic (the t's)
Lumbar (the L's)
Sacral (sacral)
Thoracocervical (t's or c's)
Thoracolumbar (t's or l's)
Lumbosacral (l's and sacral)
So for instance, the patient in the first row had "1"'s in levels_incised_2 through levels_incised_10... meaning they had values in both the cervical range and the thoracic range. So that patient should get "Thoracocervical".
The patient in row 2 had 1's in 8 through 15, so they'd only get a "thoracic"
Does anyone know the most straight forward way to accomplish this?
Oh one last detail, there's 100+ other columns so it'd be nice if I could select/name these specific ones to count
A few things to resolve here:
find a way to convert levels...# to one of the C/T/... categories;
produce logic to infer based on presence of groups.
I think the first can be done by extracting the number and using findInterval to determine with of C/T/... each column belongs to. From there, we can do some simple c_across to find "any" in a group, and case_when to get your Level labels.
library(dplyr)
# helper function for renaming
func <- function(z) {
num <- as.integer(gsub("\\D", "", z))
grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
# fix those that do not need numbering
grp[grepl("^Co", grp)] <- "Co"
grp[grepl("^unc", grp)] <- "unclear"
grp
}
out <- df %>%
rename_with(.cols = starts_with("levels"), .fn = func) %>%
rowwise() %>%
mutate(
anyC = sum(c_across(C1:C7)) > 0,
anyT = sum(c_across(T1:T12)) > 0,
anyL = sum(c_across(L1:L5)) > 0,
anyS = sum(c_across(S1:S5)) > 0
) %>%
ungroup() %>%
mutate(
Level = case_when(
anyC & anyT & anyL ~ "More than 2?",
anyL & anyS ~ "Lumbosacral",
anyT & anyL ~ "Thoracolumbar",
anyT & anyC ~ "Thoracocervical",
anyS ~ "Sacral",
anyL ~ "Lumbar",
anyT ~ "Thoracic",
anyC ~ "Cervical",
TRUE ~ "Nothing?"
)
)
out
# # A tibble: 6 x 36
# C1 C2 C3 C4 C5 C6 C7 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 Co unclear anyC anyT anyL anyS Level
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <chr>
# 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRUE TRUE FALSE FALSE Thoracocervical
# 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 FALSE FALSE FALSE TRUE Sacral
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 FALSE FALSE TRUE TRUE Lumbosacral
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 FALSE FALSE TRUE TRUE Lumbosacral
out$Level
# [1] "Thoracocervical" "Thoracic" "Thoracic" "Sacral" "Lumbosacral" "Lumbosacral"
If you don't want to keep the renaming, then you can combine the Level result to your original frame with cbind(df, Level = out$Level).
using package dplyr:
## vertebra codes needed later on
vertebra_codes <- c(
paste0('C',1:7), paste0('T',1:12),
paste0('L',1:5), paste0('S',1:5),
'X', ## for Coccyx
'-' ## for unknown
)
df %>%
mutate(
## assuming each row is a case:
case_id = paste0('case_',row_number())
) %>%
## reshape the data from wide to long format:
pivot_longer(
cols = -case_id,
names_to = 'level_incised', values_to = 'is_incised'
) %>%
mutate(
## remove the redundant 'levels_incised__' prefix:
level_incised = gsub('.*_','',level_incised),
## assign the vertebra corresponding to 'level':
vertebra = vertebra_codes[as.integer(level_incised)],
## assign the spine region (e.g.: all lumbal vert. start with 'L'
spine_region = substr(vertebra,1,1)
) %>%
filter(is_incised == 1) %>% ## we're interested in incised vert. only
## remove replicates (more than one vertebra per spine region affected:
distinct(case_id, spine_region) %>%
## do the counts per case:
group_by(case_id) %>%
## string together the affected regions per case:
summarise(incised_regions = paste(spine_region, collapse = ','))
result:
# A tibble: 6 x 2
case_id incised_regions
<chr> <chr>
1 case_1 C,T
2 case_2 T
3 case_3 T
4 case_4 S,X
5 case_5 L,S,X
6 case_6 L,S
(Note that original `df` remains unchanged throughout the processing pipeline. However you can break up the pipeline by removing the `%>%` operator and inspect the intermediary steps, or assign them to temporary objects.)
extra / for fun: example code to ggplot the spine with vertebra status (incised or not) per patient.
This question already has answers here:
For each row return the column name of the largest value
(10 answers)
Closed 2 years ago.
Hello I am trying to create a new variable in my data set, that combines each dummy of "education" with their respective character strings so I can use the factor of edu in a regression model.
I am not certain how to create a new variable "edu" with "edu4"in the first & second row and so on...
Help is much appreciated!
As you not provide the dataset by dput function I built a small example by myself.
dput(df)
structure(list(id = 1:10, edu1 = c(1, 0, 0, 0, 0, 0, 0, 0, 1,
0), edu2 = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0), edu3 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), edu4 = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
edu5 = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-10L))
Solution
df$edu = factor(apply(df[,paste0("edu", 1:5)], 1, which.max))
Result
> df
id edu1 edu2 edu3 edu4 edu5 edu
1 1 1 0 0 0 0 1
2 2 0 0 0 1 0 4
3 3 0 0 0 1 0 4
4 4 0 0 0 0 1 5
5 5 0 0 0 1 0 4
6 6 0 1 0 0 0 2
7 7 0 0 0 0 1 5
8 8 0 1 0 0 0 2
9 9 1 0 0 0 0 1
10 10 0 0 0 0 1 5
Try this: df is your data frame, and your edu variables are in colum 7 to 12. But we start from 8. If all your edu variables are 0 edu1 will be generated.
factor_variable <- factor((df[ ,8:12] %*% (1:ncol(df[ ,8:12]))) + 1,
labels = c("edu1", colnames(df[ ,8:12])))
Let me know if this worked.
This question already has an answer here:
Create new variables with mutate_at while keeping the original ones
(1 answer)
Closed 4 years ago.
Let's assume that I have data like below:
structure(list(A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8), B = c(0, 1, 1, 0, 0, 1, 4, 9.2, 9, 0, 0, 1), C = c(2, 9, 0, 0, 0, 9, 0, 0, 0, 0, 0, 8)), .Names = c("A", "B", "C"), row.names = c(NA, -12L), class = "data.frame")
Now I would like to create dummy variables for these columns for which proportion of 0's is greater than 0.5. These dummy variables would have value 0 if there is 0 in original column, and 1 if opposite. How can I accomplish that with dplyr? I was thinking of data %>% mutate_if(~mean(. == 0) > .5, ~ifelse(. == 0, 0, 1)), but this operates in place and I need to create new variables named e.g. A01, C01 and preserve the old ones A and C.
We wrap with the funs and give a different name which will append as suffix
library(dplyr)
library(stringr)
df1 %>%
mutate_if(~mean(. == 0) > .5, funs(`01` = ifelse(. == 0, 0, 1))) %>%
rename_all(str_remove, "_")
# A B C A01 C01
#1 0 0.0 2 0 1
#2 0 1.0 9 0 1
#3 0 1.0 0 0 0
#4 0 0.0 0 0 0
#5 0 0.0 0 0 0
#6 0 1.0 9 0 1
#7 0 4.0 0 0 0
#8 0 9.2 0 0 0
#9 0 9.0 0 0 0
#10 0 0.0 0 0 0
#11 1 0.0 0 1 0
#12 8 1.0 8 1 1
In the newer version of dplyr, we can use mutate with across
df1 %>%
mutate(across(where(~ mean(. == 0) > .5),
~ as.integer(. != 0), .names = '{.col}01'))
I have a large data of car accidents and a sample of it is provided below.
accident is a binary variable of whether the accident happens or
not.
shift_number is the number of the shift, 0 means the driver is
taking a rest and not a shift.
time_diff is the amount of time at each observation.
df <- data.frame(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17
)
My question is to measure the total amount of working time since the driver starts this shift for each accident.
wanted <- data.frame
(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17,
cum_time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 75)
)
Does anyone have ideas on solving this problem with R? It's better to have data.table or vectorised solution because I've got huge data to deal with.
df$cum_time = 0
accident = which(df$accident == 1)
df$cum_time[accident] <- sapply(accident, function(x) {
sum(df$time_diff[(which.max(cumsum(df$shift_number[1:x] == 0)) + 1): x])
})
df
# accident shift_number time_diff cum_time
#1 0 1 3 0
#2 0 1 4 0
#3 0 0 5 0
#4 0 0 6 0
#5 0 0 7 0
#6 0 2 8 0
#7 0 2 9 0
#8 0 2 10 0
#9 0 0 11 0
#10 0 0 12 0
#11 0 3 13 0
#12 1 3 14 27
#13 0 3 15 0
#14 0 3 16 0
#15 1 3 17 75
We first make all the values in cum_time variable as 0. We find the indices where accident has occurred. For each of those indices we find the latest 0 in shift_number and calculate the sum of values of time_diff from the latest 0 to x and assign it to its respective indices.
Use the ave function to compute the cumulative sum of time_diff by shift_number:
cumsum_by_shift <- ave(df$time_diff, df$shift_number, FUN=cumsum)
#[1] 3 7 5 11 18 8 17 27 29 41 13 27 42 58 75
Pick out elements of cumsum_by_shift where accidents occur:
cum_time <- ifelse(df$accident == 1, cumsum_by_shift, 0)
#[1] 0 0 0 0 0 0 0 0 0 0 0 27 0 0 75
Note the use of the vectorized ifelse function.
I have made a decision tree model on test data then used it to predict vales in a test dataset.
dtpredict<-predict(ct1, testdat, type="class")
The output looks like:
1 2 3 4 5 6
Class_2 Class_2 Class_6 Class_2 Class_8 Class_2
I want to write a csv to look like:
id, Class_1, Class_2, Class_3, Class_4, Class_5, Class_6, Class_7, Class_8, Class_9
1, 0, 1, 0, 0, 0, 0, 0, 0, 0
2, 0, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0, 0
4, 0, 1, 0, 0, 0, 0, 0, 0, 0
5, 0, 0, 0, 0, 0, 0, 0, 1, 0
6, 0, 1, 0, 0, 0, 0, 0, 0, 0
There's a package called dummies that does that well...
install.packages("dummies")
library(dummies)
x <- factor(c("Class_2", "Class_2", "Class_6", "Class_2", "Class_8", "Class_2"),
levels = paste("Class", 1:9, sep="_"))
dummy(x, drop = FALSE)
xClass_1 xClass_2 xClass_3 xClass_4 xClass_5 xClass_6 xClass_7 xClass_8 xClass_9
[1,] 0 1 0 0 0 0 0 0 0
[2,] 0 1 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 1 0 0 0
[4,] 0 1 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 1 0
[6,] 0 1 0 0 0 0 0 0 0
All that remains is to get rid of the "x" but this should not be too hard with something like this:
d <- dummy(x,drop = FALSE)
colnames(d) <- sub("x", "", colnames(d))
and then to save to disk:
write.csv(d, "somefile.csv", row.names = FALSE)
Uh, what are the 010101's - logicals? If so they don't make much sense in your example all are class 1 (doesn't correspond to your example dtpredict). If they are logicals....
# if dtpredict is a factor vector, where the values are the classes
# and the names are the boolean values:
values = as.numeric(as.character(names(dtpredict)))
classes = as.character(dtpredict)
x = data.frame(id=names(classes))
for(class in sort(unique(classes)){
x[ , class] = as.numeric(sapply(classes, FUN=function(p) p==class])
}
write.csv(x, 'blah.csv')