Create new columns with mutate_if [duplicate] - r

This question already has an answer here:
Create new variables with mutate_at while keeping the original ones
(1 answer)
Closed 4 years ago.
Let's assume that I have data like below:
structure(list(A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8), B = c(0, 1, 1, 0, 0, 1, 4, 9.2, 9, 0, 0, 1), C = c(2, 9, 0, 0, 0, 9, 0, 0, 0, 0, 0, 8)), .Names = c("A", "B", "C"), row.names = c(NA, -12L), class = "data.frame")
Now I would like to create dummy variables for these columns for which proportion of 0's is greater than 0.5. These dummy variables would have value 0 if there is 0 in original column, and 1 if opposite. How can I accomplish that with dplyr? I was thinking of data %>% mutate_if(~mean(. == 0) > .5, ~ifelse(. == 0, 0, 1)), but this operates in place and I need to create new variables named e.g. A01, C01 and preserve the old ones A and C.

We wrap with the funs and give a different name which will append as suffix
library(dplyr)
library(stringr)
df1 %>%
mutate_if(~mean(. == 0) > .5, funs(`01` = ifelse(. == 0, 0, 1))) %>%
rename_all(str_remove, "_")
# A B C A01 C01
#1 0 0.0 2 0 1
#2 0 1.0 9 0 1
#3 0 1.0 0 0 0
#4 0 0.0 0 0 0
#5 0 0.0 0 0 0
#6 0 1.0 9 0 1
#7 0 4.0 0 0 0
#8 0 9.2 0 0 0
#9 0 9.0 0 0 0
#10 0 0.0 0 0 0
#11 1 0.0 0 1 0
#12 8 1.0 8 1 1
In the newer version of dplyr, we can use mutate with across
df1 %>%
mutate(across(where(~ mean(. == 0) > .5),
~ as.integer(. != 0), .names = '{.col}01'))

Related

Co-occurrence calculations from a presence-absence matrix

I have this great presence-absence dataset in which I need to calculate a C score (CS) for each species pair (BABO, BW, RC, SKS, MANG) to measure species co-occurrences.
ki and kj denote the numbers of occurrences of species i and j
and K is the number of co-occurrences of both species.
I have looked at articles like Find the pair of most correlated variables and Turning co-occurrence counts into co-occurrence probabilities with cascalog but was unable to determine the most efficient way to go about it in R. I have tried creating a function but was unsuccessful.
My data:
data <- structure(list(group_id = c("2008-2-11.C3_900", "2008-2-11.C3_960",
"2008-2-11.C3_1200", "2008-2-11.C3_1230", "2008-2-11.C3_1460",
"2008-2-11.C3_1490", "2008-2-22.Mwani_0", "2008-2-22.Mwani_110",
"2008-2-22.Mwani_600", "2008-2-22.Mwani_1650", "2008-2-20.Sanje_150",
"2008-2-20.Sanje_410", "2008-2-20.Sanje_3000", "2008-5-9.C3_900",
"2008-5-13.Mwani_750", "2008-5-13.Mwani_800", "2008-5-13.Mwani_900",
"2008-5-13.Mwani_1080", "2008-5-13.Mwani_1800", "2008-5-13.Mwani_2200",
"2008-5-13.Mwani_2900"), BABO = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), BW = c(1, 1, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1), RC = c(0, 0, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0), SKS = c(0,
1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0),
MANG = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
group_id BABO BW RC SKS MANG
<chr> <dbl><dbl><dbl><dbl>
2008-2-11.C3_900 0 1 0 0 0
2008-2-11.C3_960 0 1 0 1 0
2008-2-11.C3_1200 0 1 1 1 0
2008-2-11.C3_1230 0 1 1 0 0
2008-2-11.C3_1460 0 0 1 0 0
2008-2-11.C3_1490 0 0 0 1 0
2008-2-22.Mwani_0 0 0 0 1 1
2008-2-22.Mwani_110 0 0 1 0 1
2008-2-22.Mwani_600 0 1 0 0 0
2008-2-22.Mwani_1650 0 0 0 1 0
2008-2-20.Sanje_150 0 1 0 1 0
2008-2-20.Sanje_410 0 0 1 0 0
2008-2-20.Sanje_3000 0 0 1 0 0
2008-5-9.C3_900 0 0 0 1 0
2008-5-13.Mwani_750 0 0 0 1 0
2008-5-13.Mwani_800 1 0 1 0 0
2008-5-13.Mwani_900 0 0 1 0 0
2008-5-13.Mwani_1080 0 1 1 0 0
2008-5-13.Mwani_1800 0 1 0 0 0
2008-5-13.Mwani_2200 0 1 0 0 0
2008-5-13.Mwani_2900 0 1 0 0 1
First define a function to compute CS for a given pair of species; then use combn() to generate all possible pairs; then pass each pair to the CS() function using purrr::map2_dbl().
library(dplyr)
library(purrr)
CS <- function(i, j, data) {
K <- sum(data[[i]] == 1 & data[[j]] == 1)
ki <- sum(data[[i]])
kj <- sum(data[[j]])
((ki - K) * (kj - K)) / (ki * kj)
}
names(sp_data)[-1] %>%
combn(2) %>%
t() %>%
as_tibble() %>%
set_names(c("i", "j")) %>%
mutate(CS = map2_dbl(i, j, CS, data = sp_data))
# A tibble: 10 × 3
i j CS
<chr> <chr> <dbl>
1 BABO BW 1
2 BABO RC 0
3 BABO SKS 1
4 BABO MANG 1
5 BW RC 0.467
6 BW SKS 0.438
7 BW MANG 0.6
8 RC SKS 0.778
9 RC MANG 0.593
10 SKS MANG 0.583
(Side note: using the equation you provided, species with little co-occurrence end up with high CS scores, and vice versa. If this isn’t what you expected, perhaps your equation should be subtracted from 1?)
In base R, we may use crossprod
t1 <- crossprod(as.matrix(data[-1]))
v1 <- diag(t1)
c1 <- v1[col(t1)]
r1 <- v1[row(t1)]
((c1 - t1) * (r1 - t1))/(c1 * r1)
-output
BABO BW RC SKS MANG
BABO 0 1.0000000 0.0000000 1.0000000 1.0000000
BW 1 0.0000000 0.4666667 0.4375000 0.6000000
RC 0 0.4666667 0.0000000 0.7777778 0.5925926
SKS 1 0.4375000 0.7777778 0.0000000 0.5833333
MANG 1 0.6000000 0.5925926 0.5833333 0.0000000

Replace values in a column unless there's already a "1" there

I have data like this:
df<-structure(list(a = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), b = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0)), row.names = c(NA, -19L), class = c("tbl_df",
"tbl", "data.frame"))
I would like to replace values in column A based on column B. If column B has a "1" in it, I want to replace the row in column A with a 1.
I know this can do that:
df<-df %>%mutate(a=ifelse(str_detect(b,"1"),1,0))
The problem is, this replaces everything in column A based on those rules, overwriting what was already there. I only want to replace A if it didn't already have a "1". So my expected output would be:
We may need just | on the binary column to replace the values in 'a' where 'b' is also 1
library(dplyr)
df %>%
mutate(a = +(a|b))
-output
# A tibble: 19 × 2
a b
<int> <dbl>
1 1 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 0 0
16 1 1
17 0 0
18 0 0
19 0 0
Or in base R
df$a[df$b == 1] <- 1

Count occurrences in specific column ranges and return factor variable, R

I have data like this:
df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1,
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1,
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1,
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1,
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1,
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0,
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0,
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0,
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0,
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0,
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0,
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0,
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0,
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0,
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0,
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
That originally came from this Redcap input where each button was one of those columns:
And I need to create a column at the end (lets call it Level) with these possible inputs:
Cervical (any of the c buttons)
Thoracic (the t's)
Lumbar (the L's)
Sacral (sacral)
Thoracocervical (t's or c's)
Thoracolumbar (t's or l's)
Lumbosacral (l's and sacral)
So for instance, the patient in the first row had "1"'s in levels_incised_2 through levels_incised_10... meaning they had values in both the cervical range and the thoracic range. So that patient should get "Thoracocervical".
The patient in row 2 had 1's in 8 through 15, so they'd only get a "thoracic"
Does anyone know the most straight forward way to accomplish this?
Oh one last detail, there's 100+ other columns so it'd be nice if I could select/name these specific ones to count
A few things to resolve here:
find a way to convert levels...# to one of the C/T/... categories;
produce logic to infer based on presence of groups.
I think the first can be done by extracting the number and using findInterval to determine with of C/T/... each column belongs to. From there, we can do some simple c_across to find "any" in a group, and case_when to get your Level labels.
library(dplyr)
# helper function for renaming
func <- function(z) {
num <- as.integer(gsub("\\D", "", z))
grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
# fix those that do not need numbering
grp[grepl("^Co", grp)] <- "Co"
grp[grepl("^unc", grp)] <- "unclear"
grp
}
out <- df %>%
rename_with(.cols = starts_with("levels"), .fn = func) %>%
rowwise() %>%
mutate(
anyC = sum(c_across(C1:C7)) > 0,
anyT = sum(c_across(T1:T12)) > 0,
anyL = sum(c_across(L1:L5)) > 0,
anyS = sum(c_across(S1:S5)) > 0
) %>%
ungroup() %>%
mutate(
Level = case_when(
anyC & anyT & anyL ~ "More than 2?",
anyL & anyS ~ "Lumbosacral",
anyT & anyL ~ "Thoracolumbar",
anyT & anyC ~ "Thoracocervical",
anyS ~ "Sacral",
anyL ~ "Lumbar",
anyT ~ "Thoracic",
anyC ~ "Cervical",
TRUE ~ "Nothing?"
)
)
out
# # A tibble: 6 x 36
# C1 C2 C3 C4 C5 C6 C7 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 Co unclear anyC anyT anyL anyS Level
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <chr>
# 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRUE TRUE FALSE FALSE Thoracocervical
# 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 FALSE FALSE FALSE TRUE Sacral
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 FALSE FALSE TRUE TRUE Lumbosacral
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 FALSE FALSE TRUE TRUE Lumbosacral
out$Level
# [1] "Thoracocervical" "Thoracic" "Thoracic" "Sacral" "Lumbosacral" "Lumbosacral"
If you don't want to keep the renaming, then you can combine the Level result to your original frame with cbind(df, Level = out$Level).
using package dplyr:
## vertebra codes needed later on
vertebra_codes <- c(
paste0('C',1:7), paste0('T',1:12),
paste0('L',1:5), paste0('S',1:5),
'X', ## for Coccyx
'-' ## for unknown
)
df %>%
mutate(
## assuming each row is a case:
case_id = paste0('case_',row_number())
) %>%
## reshape the data from wide to long format:
pivot_longer(
cols = -case_id,
names_to = 'level_incised', values_to = 'is_incised'
) %>%
mutate(
## remove the redundant 'levels_incised__' prefix:
level_incised = gsub('.*_','',level_incised),
## assign the vertebra corresponding to 'level':
vertebra = vertebra_codes[as.integer(level_incised)],
## assign the spine region (e.g.: all lumbal vert. start with 'L'
spine_region = substr(vertebra,1,1)
) %>%
filter(is_incised == 1) %>% ## we're interested in incised vert. only
## remove replicates (more than one vertebra per spine region affected:
distinct(case_id, spine_region) %>%
## do the counts per case:
group_by(case_id) %>%
## string together the affected regions per case:
summarise(incised_regions = paste(spine_region, collapse = ','))
result:
# A tibble: 6 x 2
case_id incised_regions
<chr> <chr>
1 case_1 C,T
2 case_2 T
3 case_3 T
4 case_4 S,X
5 case_5 L,S,X
6 case_6 L,S
(Note that original `df` remains unchanged throughout the processing pipeline. However you can break up the pipeline by removing the `%>%` operator and inspect the intermediary steps, or assign them to temporary objects.)
extra / for fun: example code to ggplot the spine with vertebra status (incised or not) per patient.

Dummy variables to factor [duplicate]

This question already has answers here:
For each row return the column name of the largest value
(10 answers)
Closed 2 years ago.
Hello I am trying to create a new variable in my data set, that combines each dummy of "education" with their respective character strings so I can use the factor of edu in a regression model.
I am not certain how to create a new variable "edu" with "edu4"in the first & second row and so on...
Help is much appreciated!
As you not provide the dataset by dput function I built a small example by myself.
dput(df)
structure(list(id = 1:10, edu1 = c(1, 0, 0, 0, 0, 0, 0, 0, 1,
0), edu2 = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0), edu3 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), edu4 = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
edu5 = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-10L))
Solution
df$edu = factor(apply(df[,paste0("edu", 1:5)], 1, which.max))
Result
> df
id edu1 edu2 edu3 edu4 edu5 edu
1 1 1 0 0 0 0 1
2 2 0 0 0 1 0 4
3 3 0 0 0 1 0 4
4 4 0 0 0 0 1 5
5 5 0 0 0 1 0 4
6 6 0 1 0 0 0 2
7 7 0 0 0 0 1 5
8 8 0 1 0 0 0 2
9 9 1 0 0 0 0 1
10 10 0 0 0 0 1 5
Try this: df is your data frame, and your edu variables are in colum 7 to 12. But we start from 8. If all your edu variables are 0 edu1 will be generated.
factor_variable <- factor((df[ ,8:12] %*% (1:ncol(df[ ,8:12]))) + 1,
labels = c("edu1", colnames(df[ ,8:12])))
Let me know if this worked.

cumsum according to certain restricts in r

I have a large data of car accidents and a sample of it is provided below.
accident is a binary variable of whether the accident happens or
not.
shift_number is the number of the shift, 0 means the driver is
taking a rest and not a shift.
time_diff is the amount of time at each observation.
df <- data.frame(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17
)
My question is to measure the total amount of working time since the driver starts this shift for each accident.
wanted <- data.frame
(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17,
cum_time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 75)
)
Does anyone have ideas on solving this problem with R? It's better to have data.table or vectorised solution because I've got huge data to deal with.
df$cum_time = 0
accident = which(df$accident == 1)
df$cum_time[accident] <- sapply(accident, function(x) {
sum(df$time_diff[(which.max(cumsum(df$shift_number[1:x] == 0)) + 1): x])
})
df
# accident shift_number time_diff cum_time
#1 0 1 3 0
#2 0 1 4 0
#3 0 0 5 0
#4 0 0 6 0
#5 0 0 7 0
#6 0 2 8 0
#7 0 2 9 0
#8 0 2 10 0
#9 0 0 11 0
#10 0 0 12 0
#11 0 3 13 0
#12 1 3 14 27
#13 0 3 15 0
#14 0 3 16 0
#15 1 3 17 75
We first make all the values in cum_time variable as 0. We find the indices where accident has occurred. For each of those indices we find the latest 0 in shift_number and calculate the sum of values of time_diff from the latest 0 to x and assign it to its respective indices.
Use the ave function to compute the cumulative sum of time_diff by shift_number:
cumsum_by_shift <- ave(df$time_diff, df$shift_number, FUN=cumsum)
#[1] 3 7 5 11 18 8 17 27 29 41 13 27 42 58 75
Pick out elements of cumsum_by_shift where accidents occur:
cum_time <- ifelse(df$accident == 1, cumsum_by_shift, 0)
#[1] 0 0 0 0 0 0 0 0 0 0 0 27 0 0 75
Note the use of the vectorized ifelse function.

Resources