Getting combination Percentages in r - r

Relatively new to R and very new here in stackoverflow.
I'm trying to analyze .csv output files from a microscope.
The output will tell me whether each cell on the image is "positive" (expressed with a 1) or "negative" (with a 0)
my_data <- data.frame(cell = 1:4, marker_a = c(1, 0, 0, 0), marker_b = c(0,1,1,1), marker_c = c(0,1,1,0))
Sometimes we measure 4 markers, sometimes more.
I already wrote something that gives me a vector with the "used markers" and discards the "unused markers" (in this case it would be marker e,f, g which also show up in the .csv file).
I want to automatically get all the possible combinations that a cell can take.
A cell can be 0 for all markers, or can be positive for marker_a but negative for marker_b,marker_c,marker_d.
My end goal is to quantify all the cells that fall under each category/combination.
I would want a vector that would name each possible combination from all markers with a 0 value, to all of them wiht a 1 value.
What I have been doing so far is manually generating the combinations.
no_marker <- my_data$marker_a == 0 & my_data$marker_b == 0 & my_data$marker_c == 0
a_positive <- my_data$marker_a == 1 & my_data$marker_b == 0 & my_data$marker_c == 0...
Then I can just create a data.frame to add more samples later.
cell_phenotypes <- c("no_marker", "a_positive", "ab_positive", "abc_positive", "abcd_positive", "b_ positive", "bc_positive"...)
I just don't want to manually create the vector every time.

It sounds like you want expand.grid.
expand.grid(
marker_a = c(0, 1),
marker_b = c(0, 1),
marker_c = c(0, 1),
marker_d = c(0, 1)
)
#> marker_a marker_b marker_c marker_d
#> 1 0 0 0 0
#> 2 1 0 0 0
#> 3 0 1 0 0
#> 4 1 1 0 0
#> 5 0 0 1 0
#> 6 1 0 1 0
#> 7 0 1 1 0
#> 8 1 1 1 0
#> 9 0 0 0 1
#> 10 1 0 0 1
#> 11 0 1 0 1
#> 12 1 1 0 1
#> 13 0 0 1 1
#> 14 1 0 1 1
#> 15 0 1 1 1
#> 16 1 1 1 1
Note that 16 is the right number; you can check since 2**4 = 16.

Related

binary variable does not correspond with category numbers

I'm trying to create a new binary variable based on several categorical variables. I have tried multiple ways to do this including base or if else commands, and mutate case when from dyplyr. When I make the new variable, the number does not add up to how many were in the categories in the original variables.
data<- c("ket"(1,0,0,0,1,0)
c("weed"(0,1,1,1,0,0)
c("speed"(1,0,0,1,0,0)
c("meth"(0,0,0,1,0,0)
data<-data%>%
mutate(druguse = case_when(
(weed==1 | ket==1 | meth==1 | speed==1) ~1,
(TRUE ~0)))
the new variable should add up to how many answered one in each category, but the number in my new variable is a lot lower.
thank you!
You can avoid having to write out an explicit case_when here, by taking the sign of the sum of each row. This will be 0 if the whole row is zero, and one otherwise.
data %>% mutate(druguse = sign(rowSums(.)))
#> ket weed speed meth druguse
#> 1 1 0 1 0 1
#> 2 0 1 0 0 1
#> 3 0 1 0 0 1
#> 4 0 1 1 1 1
#> 5 1 0 0 0 1
#> 6 0 0 0 0 0
Data
data <- structure(list(ket = c(1, 0, 0, 0, 1, 0), weed = c(0, 1, 1, 1,
0, 0), speed = c(1, 0, 0, 1, 0, 0), meth = c(0, 0, 0, 1, 0, 0
)), class = "data.frame", row.names = c(NA, -6L))
data
#> ket weed speed meth
#> 1 1 0 1 0
#> 2 0 1 0 0
#> 3 0 1 0 0
#> 4 0 1 1 1
#> 5 1 0 0 0
#> 6 0 0 0 0

r: how to simultaneously change multiple column names based on the individual suffix of each column name

I have received a datasheet p autogenerated from a registry and containing 1855 columns. The autogeneration adds _vX automatically to each column name where X correspond the number of follow-ups. Unfortunately, this creates ridiculously long column names.
Eg
p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10 and p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
correspond to the 10th and 20th MRI scan on the same patient. I.e., each column that addresses clinical parameters related to the 10th follow-up ends with v1_v2_v3_v4_v5_v6_v7_v8_v9_v10.
I seek a solution, preferably in dplyr or a function, that changes the entire _v1_v2_...." suffix to fuX corresponding to the xth follow-up.
Lets say that p looks like:
a_v2 b_v2_v3 a_v2_v3_v4 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
Data
p <- structure(list(dia_maxrd_v2 = c(0, 1), hear_sev_v2_v3 = c(1, 1), reop_ind_v2_v3_v4___1 = c(1,
0), neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(1,
1), symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(0,
0)), class = "data.frame", row.names = c(NA, -2L))
EDIT
To complicate things, some column names end with "___1" indicating a specific parameter relating to that clinical parameter and should be preserved, e.g.: _v1_v2_v3_v4___1. Hence, this is still to be considered as fu4 and the ___1 part should not be omitted.
a_v2 b_v2_v3 a_v2_v3_v4___1 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
EDIT
My apologies, the solution must consider the "basic" column name specifying what parameter the column contain, e.g. post-surgical complications. It is only the _v1_v2_v3..._vX-part that should be substituted with the corresponding fuX. What comes before and after the _v1_v2_v3..._vX-part must be preserved.
Consider
dia_maxrd_v2 hear_sev_v2_v3 reop_ind_v2_v3_v4___1 neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20 symp_pre_lokal_fu20
1 0 1 1 1 0
2 1 1 0 1 0
You can use gsub with two capturing groups:
names(p) <- gsub("^(.).*?(\\d+)$", "\\1_fu\\2", names(p))
p
#> a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
With new requirements stipulated by OP for including in pipe in having some different endings not in original question:
p %>% setNames(gsub("^(.).*?(\\d+_*\\d*)$", "\\1_fu\\2", names(.)))
#> a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
For arbitrary starting strings, it may be easiest to gsub twice:
p %>% setNames(gsub("(\\d{1,2}_v)+", "", names(.))) %>%
setNames(gsub("_v(\\d+)", "_fu\\1", names(.)))
#> dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20
#> 1 0 1 1 1
#> 2 1 1 0 1
#> symp_pre_lokal_fu20
#> 1 0
#> 2 0

recode using ifelse clause within groups

I'm trying to set up column (called 'combined) to indicate the combined information of owner and Head within each group (Group). There is only 1 owner in each group, and 'Head' is basically the first row of each group that has the minimum id value.
This combined column should flag '1' if the ID is flagged as owner, then the rest of the id within each group will be 0 regardless of the information in 'Head'. However for groups that do not have any Owner in the IDs (i.e. all 0 in owner within the group), then this column will take the Head column information. My data looks like this and the last column (combined) is the desired outcome.
sample <- data.frame(Group = c("46005589", "46005589","46005590","46005591", "46005591","46005592","46005592","46005592", "46005593", "46005594"), ID= c("189199", "2957073", "272448", "1872092", "10374996", "1153514", "2771118","10281300", "2610301", "3564526"), Owner = c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1), Head = c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1), combined = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 1))
> sample
Group ID Owner Head combined
1 46005589 189199 0 1 0
2 46005589 2957073 1 0 1
3 46005590 272448 1 0 1
4 46005591 1872092 0 1 1
5 46005591 10374996 0 0 0
6 46005592 1153514 0 1 0
7 46005592 2771118 1 0 1
8 46005592 10281300 0 0 0
9 46005593 2610301 1 1 1
10 46005594 3564526 1 1 1
I've tried a few dplyr and ifelse clauses and it didn't seem to give outputs to what I wanted. How should I recode this column? Thanks.
I don't think this is the best way but you might look at visually inspecting IDs with all 0s. You could do this with rowSums and specify these IDs using %in%. Here is a possible solution:
library(dplyr)
df %>%
mutate_at(vars(ID,Group),funs(as.factor)) %>%
mutate(Combined=if_else(Owner==1,1,0),
NewCombi=ifelse(ID== "1872092",Head,Combined))
This yields: NewCombi is our target.
# Group ID Owner Head Combined NewCombi
#1 46005589 189199 0 1 0 0
#2 46005589 2957073 1 0 1 1
#3 46005590 272448 1 0 1 1
#4 46005591 1872092 0 1 0 1
#5 46005591 10374996 0 0 0 0
#6 46005592 1153514 0 1 0 0
#7 46005592 2771118 1 0 1 1
#8 46005592 10281300 0 0 0 0
#9 46005593 2610301 1 1 1 1
#10 46005594 3564526 1 1 1 1
The new combined column can be created in two steps in dplyr: first use filter(all(Owner == 0))by creating a column that only contains 'Head' information of IDs that do not contain any 'Owner', then merge this column back to the original dataframe, sum up the 1s in this column and the 1s 'Owner' column to obtain the combined info.
library(dplyr)
sample2 <- sample %>%
group_by(Group) %>%
filter(all(Owner == 0)) %>%
mutate(Head_nullowner = ifelse(Head == 1, 1, 0)) #select all rows of IDs that do not have any owners
#merge Head_nullowner with the original dataframe by both Group and ID
sample <- merge(sample, sample2[c("Group", "ID", "Head_nullowner")], by.x = c("Group", "ID"), by.y = c("Group", "ID"), all.x = T)
sample$Head_nullowner[is.na(sample$Head_nullowner)] <- 0
sample$OwnerHead_combined = sample$Owner + sample$Head_nullowner
> sample
Group ID Owner Head combined Head_nullowner OwnerHead_combined
1 46005589 189199 0 1 0 0 0
2 46005589 2957073 1 0 1 0 1
3 46005590 272448 1 0 1 0 1
4 46005591 10374996 0 0 0 0 0
5 46005591 1872092 0 1 1 1 1
6 46005592 10281300 0 0 0 0 0
7 46005592 1153514 0 1 0 0 0
8 46005592 2771118 1 0 1 0 1
9 46005593 2610301 1 1 1 0 1
10 46005594 3564526 1 1 1 0 1

R: modify a variable conditioned on data from multiple previous rows

Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.

Dummy code categorical / ordinal variables in the tidyverse r

Let's say I have a tibble.
library(tidyverse)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib
# A tibble: 10 x 3
record gender like_product
<int> <fctr> <fctr>
1 1 F 2
2 2 M 1
3 3 M 2
4 4 F 3
5 5 F 4
6 6 M 2
7 7 F 4
8 8 M 4
9 9 F 4
10 10 M 5
I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.
# A tibble: 10 x 8
record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 1 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 1 0 0 0
4 4 0 1 1 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 0 1 0 0 0 0 0
7 7 0 1 0 0 0 0 0
8 8 0 1 0 1 0 0 0
9 9 1 0 0 0 0 0 0
10 10 1 0 0 0 0 0 1
My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.
I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...
tib %>%
mutate(gender_M = ifelse(gender == "M", 1, 0),
gender_F = ifelse(gender == "F", 1, 0),
like_product_1 = ifelse(like_product == 1, 1, 0),
like_product_2 = ifelse(like_product == 2, 1, 0),
like_product_3 = ifelse(like_product == 3, 1, 0),
like_product_4 = ifelse(like_product == 4, 1, 0),
like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
select(-gender, -like_product)
But this would break my workflow rules of needing to specify every dummy coded output.
I've done this in the past with model.matrix, from the stats package.
model.matrix(~ gender + like_product, tib)
Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.
I think the solution is in purrr
library(purrr)
dummy_code <- function(x) {
lvls <- levels(x)
sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
}
tib %>%
map_at(c("gender", "like_product"), dummy_code)
$record
[1] 1 2 3 4 5 6 7 8 9 10
$gender
# A tibble: 10 x 2
F M
<int> <int>
1 1 0
2 0 1
3 0 1
4 1 0
5 1 0
6 0 1
7 1 0
8 0 1
9 1 0
10 0 1
$like_product
# A tibble: 10 x 5
`1` `2` `3` `4` `5`
<int> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 1 0 0 0
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
10 0 0 0 0 1
This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.
Any better ideas? Thanks!!
An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.
I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.
Example code below.
But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.
library(recipes)
library(tibble)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE))))
dum <- tib %>%
recipe(~ .) %>%
step_dummy(gender, like_product) %>%
prep(training = tib) %>%
bake(newdata = tib)
dum
# A tibble: 10 x 6
record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1. 1. 0. 0. 0.
2 2 1. 1. 0. 0. 0.
3 3 1. 1. 0. 0. 0.
4 4 0. 0. 1. 0. 0.
5 5 0. 0. 0. 0. 0.
6 6 0. 1. 0. 0. 0.
7 7 0. 1. 0. 0. 0.
8 8 0. 0. 0. 1. 0.
9 9 0. 0. 0. 0. 1.
10 10 1. 0. 0. 0. 0.
In case you don't want to load any additional packages, you could also use pivot_wider statements like this:
tib %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = gender, values_from = dummy, values_fill = 0) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = like_product, values_from = dummy, values_fill = 0, names_glue = "like_product_{like_product}")

Resources