I have this great presence-absence dataset in which I need to calculate a C score (CS) for each species pair (BABO, BW, RC, SKS, MANG) to measure species co-occurrences.
ki and kj denote the numbers of occurrences of species i and j
and K is the number of co-occurrences of both species.
I have looked at articles like Find the pair of most correlated variables and Turning co-occurrence counts into co-occurrence probabilities with cascalog but was unable to determine the most efficient way to go about it in R. I have tried creating a function but was unsuccessful.
My data:
data <- structure(list(group_id = c("2008-2-11.C3_900", "2008-2-11.C3_960",
"2008-2-11.C3_1200", "2008-2-11.C3_1230", "2008-2-11.C3_1460",
"2008-2-11.C3_1490", "2008-2-22.Mwani_0", "2008-2-22.Mwani_110",
"2008-2-22.Mwani_600", "2008-2-22.Mwani_1650", "2008-2-20.Sanje_150",
"2008-2-20.Sanje_410", "2008-2-20.Sanje_3000", "2008-5-9.C3_900",
"2008-5-13.Mwani_750", "2008-5-13.Mwani_800", "2008-5-13.Mwani_900",
"2008-5-13.Mwani_1080", "2008-5-13.Mwani_1800", "2008-5-13.Mwani_2200",
"2008-5-13.Mwani_2900"), BABO = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), BW = c(1, 1, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1), RC = c(0, 0, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0), SKS = c(0,
1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0),
MANG = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
group_id BABO BW RC SKS MANG
<chr> <dbl><dbl><dbl><dbl>
2008-2-11.C3_900 0 1 0 0 0
2008-2-11.C3_960 0 1 0 1 0
2008-2-11.C3_1200 0 1 1 1 0
2008-2-11.C3_1230 0 1 1 0 0
2008-2-11.C3_1460 0 0 1 0 0
2008-2-11.C3_1490 0 0 0 1 0
2008-2-22.Mwani_0 0 0 0 1 1
2008-2-22.Mwani_110 0 0 1 0 1
2008-2-22.Mwani_600 0 1 0 0 0
2008-2-22.Mwani_1650 0 0 0 1 0
2008-2-20.Sanje_150 0 1 0 1 0
2008-2-20.Sanje_410 0 0 1 0 0
2008-2-20.Sanje_3000 0 0 1 0 0
2008-5-9.C3_900 0 0 0 1 0
2008-5-13.Mwani_750 0 0 0 1 0
2008-5-13.Mwani_800 1 0 1 0 0
2008-5-13.Mwani_900 0 0 1 0 0
2008-5-13.Mwani_1080 0 1 1 0 0
2008-5-13.Mwani_1800 0 1 0 0 0
2008-5-13.Mwani_2200 0 1 0 0 0
2008-5-13.Mwani_2900 0 1 0 0 1
First define a function to compute CS for a given pair of species; then use combn() to generate all possible pairs; then pass each pair to the CS() function using purrr::map2_dbl().
library(dplyr)
library(purrr)
CS <- function(i, j, data) {
K <- sum(data[[i]] == 1 & data[[j]] == 1)
ki <- sum(data[[i]])
kj <- sum(data[[j]])
((ki - K) * (kj - K)) / (ki * kj)
}
names(sp_data)[-1] %>%
combn(2) %>%
t() %>%
as_tibble() %>%
set_names(c("i", "j")) %>%
mutate(CS = map2_dbl(i, j, CS, data = sp_data))
# A tibble: 10 × 3
i j CS
<chr> <chr> <dbl>
1 BABO BW 1
2 BABO RC 0
3 BABO SKS 1
4 BABO MANG 1
5 BW RC 0.467
6 BW SKS 0.438
7 BW MANG 0.6
8 RC SKS 0.778
9 RC MANG 0.593
10 SKS MANG 0.583
(Side note: using the equation you provided, species with little co-occurrence end up with high CS scores, and vice versa. If this isn’t what you expected, perhaps your equation should be subtracted from 1?)
In base R, we may use crossprod
t1 <- crossprod(as.matrix(data[-1]))
v1 <- diag(t1)
c1 <- v1[col(t1)]
r1 <- v1[row(t1)]
((c1 - t1) * (r1 - t1))/(c1 * r1)
-output
BABO BW RC SKS MANG
BABO 0 1.0000000 0.0000000 1.0000000 1.0000000
BW 1 0.0000000 0.4666667 0.4375000 0.6000000
RC 0 0.4666667 0.0000000 0.7777778 0.5925926
SKS 1 0.4375000 0.7777778 0.0000000 0.5833333
MANG 1 0.6000000 0.5925926 0.5833333 0.0000000
Related
I have data like this:
df<-structure(list(a = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), b = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0)), row.names = c(NA, -19L), class = c("tbl_df",
"tbl", "data.frame"))
I would like to replace values in column A based on column B. If column B has a "1" in it, I want to replace the row in column A with a 1.
I know this can do that:
df<-df %>%mutate(a=ifelse(str_detect(b,"1"),1,0))
The problem is, this replaces everything in column A based on those rules, overwriting what was already there. I only want to replace A if it didn't already have a "1". So my expected output would be:
We may need just | on the binary column to replace the values in 'a' where 'b' is also 1
library(dplyr)
df %>%
mutate(a = +(a|b))
-output
# A tibble: 19 × 2
a b
<int> <dbl>
1 1 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 0 0
16 1 1
17 0 0
18 0 0
19 0 0
Or in base R
df$a[df$b == 1] <- 1
I have data like this:
df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1,
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1,
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1,
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1,
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1,
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0,
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0,
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0,
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0,
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0,
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0,
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0,
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0,
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0,
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0,
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
That originally came from this Redcap input where each button was one of those columns:
And I need to create a column at the end (lets call it Level) with these possible inputs:
Cervical (any of the c buttons)
Thoracic (the t's)
Lumbar (the L's)
Sacral (sacral)
Thoracocervical (t's or c's)
Thoracolumbar (t's or l's)
Lumbosacral (l's and sacral)
So for instance, the patient in the first row had "1"'s in levels_incised_2 through levels_incised_10... meaning they had values in both the cervical range and the thoracic range. So that patient should get "Thoracocervical".
The patient in row 2 had 1's in 8 through 15, so they'd only get a "thoracic"
Does anyone know the most straight forward way to accomplish this?
Oh one last detail, there's 100+ other columns so it'd be nice if I could select/name these specific ones to count
A few things to resolve here:
find a way to convert levels...# to one of the C/T/... categories;
produce logic to infer based on presence of groups.
I think the first can be done by extracting the number and using findInterval to determine with of C/T/... each column belongs to. From there, we can do some simple c_across to find "any" in a group, and case_when to get your Level labels.
library(dplyr)
# helper function for renaming
func <- function(z) {
num <- as.integer(gsub("\\D", "", z))
grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
# fix those that do not need numbering
grp[grepl("^Co", grp)] <- "Co"
grp[grepl("^unc", grp)] <- "unclear"
grp
}
out <- df %>%
rename_with(.cols = starts_with("levels"), .fn = func) %>%
rowwise() %>%
mutate(
anyC = sum(c_across(C1:C7)) > 0,
anyT = sum(c_across(T1:T12)) > 0,
anyL = sum(c_across(L1:L5)) > 0,
anyS = sum(c_across(S1:S5)) > 0
) %>%
ungroup() %>%
mutate(
Level = case_when(
anyC & anyT & anyL ~ "More than 2?",
anyL & anyS ~ "Lumbosacral",
anyT & anyL ~ "Thoracolumbar",
anyT & anyC ~ "Thoracocervical",
anyS ~ "Sacral",
anyL ~ "Lumbar",
anyT ~ "Thoracic",
anyC ~ "Cervical",
TRUE ~ "Nothing?"
)
)
out
# # A tibble: 6 x 36
# C1 C2 C3 C4 C5 C6 C7 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 Co unclear anyC anyT anyL anyS Level
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <chr>
# 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRUE TRUE FALSE FALSE Thoracocervical
# 2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FALSE TRUE FALSE FALSE Thoracic
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 FALSE FALSE FALSE TRUE Sacral
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 FALSE FALSE TRUE TRUE Lumbosacral
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 FALSE FALSE TRUE TRUE Lumbosacral
out$Level
# [1] "Thoracocervical" "Thoracic" "Thoracic" "Sacral" "Lumbosacral" "Lumbosacral"
If you don't want to keep the renaming, then you can combine the Level result to your original frame with cbind(df, Level = out$Level).
using package dplyr:
## vertebra codes needed later on
vertebra_codes <- c(
paste0('C',1:7), paste0('T',1:12),
paste0('L',1:5), paste0('S',1:5),
'X', ## for Coccyx
'-' ## for unknown
)
df %>%
mutate(
## assuming each row is a case:
case_id = paste0('case_',row_number())
) %>%
## reshape the data from wide to long format:
pivot_longer(
cols = -case_id,
names_to = 'level_incised', values_to = 'is_incised'
) %>%
mutate(
## remove the redundant 'levels_incised__' prefix:
level_incised = gsub('.*_','',level_incised),
## assign the vertebra corresponding to 'level':
vertebra = vertebra_codes[as.integer(level_incised)],
## assign the spine region (e.g.: all lumbal vert. start with 'L'
spine_region = substr(vertebra,1,1)
) %>%
filter(is_incised == 1) %>% ## we're interested in incised vert. only
## remove replicates (more than one vertebra per spine region affected:
distinct(case_id, spine_region) %>%
## do the counts per case:
group_by(case_id) %>%
## string together the affected regions per case:
summarise(incised_regions = paste(spine_region, collapse = ','))
result:
# A tibble: 6 x 2
case_id incised_regions
<chr> <chr>
1 case_1 C,T
2 case_2 T
3 case_3 T
4 case_4 S,X
5 case_5 L,S,X
6 case_6 L,S
(Note that original `df` remains unchanged throughout the processing pipeline. However you can break up the pipeline by removing the `%>%` operator and inspect the intermediary steps, or assign them to temporary objects.)
extra / for fun: example code to ggplot the spine with vertebra status (incised or not) per patient.
This question already has answers here:
For each row return the column name of the largest value
(10 answers)
Closed 2 years ago.
Hello I am trying to create a new variable in my data set, that combines each dummy of "education" with their respective character strings so I can use the factor of edu in a regression model.
I am not certain how to create a new variable "edu" with "edu4"in the first & second row and so on...
Help is much appreciated!
As you not provide the dataset by dput function I built a small example by myself.
dput(df)
structure(list(id = 1:10, edu1 = c(1, 0, 0, 0, 0, 0, 0, 0, 1,
0), edu2 = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0), edu3 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0), edu4 = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
edu5 = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-10L))
Solution
df$edu = factor(apply(df[,paste0("edu", 1:5)], 1, which.max))
Result
> df
id edu1 edu2 edu3 edu4 edu5 edu
1 1 1 0 0 0 0 1
2 2 0 0 0 1 0 4
3 3 0 0 0 1 0 4
4 4 0 0 0 0 1 5
5 5 0 0 0 1 0 4
6 6 0 1 0 0 0 2
7 7 0 0 0 0 1 5
8 8 0 1 0 0 0 2
9 9 1 0 0 0 0 1
10 10 0 0 0 0 1 5
Try this: df is your data frame, and your edu variables are in colum 7 to 12. But we start from 8. If all your edu variables are 0 edu1 will be generated.
factor_variable <- factor((df[ ,8:12] %*% (1:ncol(df[ ,8:12]))) + 1,
labels = c("edu1", colnames(df[ ,8:12])))
Let me know if this worked.
My data comes from a multiple-choice question where respondents could choose more than one selection (the five selections are different roles they held, such as Role1 is a participant on the IT Committee, or Role2 is a participant on the Budget Committee, etc.). I converted the roles into binary variables where a "1" indicates the respondent selected that role and a "0" indicates they did not select it.
Here is sample data:
structure(list(Role1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 1), Role2 = c(0, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1), Role3 = c(1, 0, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1), Role4 = c(0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0), Role5 = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Using the bindata package, its condprob function calculates the probability that a respondent held any of the roles given that they held another role.
library(bindata)
# Returns a matrix containing the conditional probabilities, and converts the matrix to a data frame
condlTable.df <- as.data.frame(condprob(SOdata))
My programming question: How can R take any random pair of roles and calculate the conditional probability of any of the other roles? For example, if a respondent selected Role1 and Role2, a pair of roles, what is the probability that they also selected Role3, or Role4, or Role5? The ideal output would be like the output of condprob but for pairs of selections.
Thank you for your help.
Would the prop.table function give you what you need?
For example:
prop.table(condlTable.df)
This gives you the proportions by row:
Role1 Role2 Role3 Role4 Role5
Role1 0.07097829 0.05977119 0.02241420 0.01867850 0.007471399
Role2 0.07097829 0.07097829 0.01774457 0.01330843 0.008872286
Role3 0.06083853 0.04055902 0.07097829 0.02027951 0.020279510
Role4 0.07097829 0.04258697 0.02839131 0.07097829 0.000000000
Role5 0.04731886 0.04731886 0.04731886 0.00000000 0.070978286
adding CondlTable.df per my comment/question below.
Role1 Role2 Role3 Role4 Role5
Role1 1.0000000 0.8421053 0.3157895 0.2631579 0.1052632
Role2 1.0000000 1.0000000 0.2500000 0.1875000 0.1250000
Role3 0.8571429 0.5714286 1.0000000 0.2857143 0.2857143
Role4 1.0000000 0.6000000 0.4000000 1.0000000 0.0000000
Role5 0.6666667 0.6666667 0.6666667 0.0000000 1.0000000
Here is your original data:
Role1 Role2 Role3 Role4 Role5
1 1 0 1 0 0
2 1 1 0 1 0
3 1 1 0 0 0
4 1 1 0 1 0
5 1 1 0 0 0
6 1 0 1 1 0
7 1 1 0 0 0
8 1 1 0 0 1
9 1 1 0 0 0
10 1 1 1 0 0
11 1 1 0 0 0
12 1 0 0 1 0
13 1 1 1 0 0
14 1 1 0 0 0
15 1 1 0 0 0
16 1 1 0 0 0
17 1 1 1 1 0
18 1 1 0 0 0
19 0 0 1 0 1
20 1 1 1 0 1
I have a large data of car accidents and a sample of it is provided below.
accident is a binary variable of whether the accident happens or
not.
shift_number is the number of the shift, 0 means the driver is
taking a rest and not a shift.
time_diff is the amount of time at each observation.
df <- data.frame(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17
)
My question is to measure the total amount of working time since the driver starts this shift for each accident.
wanted <- data.frame
(
accident = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1),
shift_number = c(1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3),
time_diff = 3:17,
cum_time = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 0, 0, 75)
)
Does anyone have ideas on solving this problem with R? It's better to have data.table or vectorised solution because I've got huge data to deal with.
df$cum_time = 0
accident = which(df$accident == 1)
df$cum_time[accident] <- sapply(accident, function(x) {
sum(df$time_diff[(which.max(cumsum(df$shift_number[1:x] == 0)) + 1): x])
})
df
# accident shift_number time_diff cum_time
#1 0 1 3 0
#2 0 1 4 0
#3 0 0 5 0
#4 0 0 6 0
#5 0 0 7 0
#6 0 2 8 0
#7 0 2 9 0
#8 0 2 10 0
#9 0 0 11 0
#10 0 0 12 0
#11 0 3 13 0
#12 1 3 14 27
#13 0 3 15 0
#14 0 3 16 0
#15 1 3 17 75
We first make all the values in cum_time variable as 0. We find the indices where accident has occurred. For each of those indices we find the latest 0 in shift_number and calculate the sum of values of time_diff from the latest 0 to x and assign it to its respective indices.
Use the ave function to compute the cumulative sum of time_diff by shift_number:
cumsum_by_shift <- ave(df$time_diff, df$shift_number, FUN=cumsum)
#[1] 3 7 5 11 18 8 17 27 29 41 13 27 42 58 75
Pick out elements of cumsum_by_shift where accidents occur:
cum_time <- ifelse(df$accident == 1, cumsum_by_shift, 0)
#[1] 0 0 0 0 0 0 0 0 0 0 0 27 0 0 75
Note the use of the vectorized ifelse function.