Assign value to new column based on values in 2 other columns - r

Here is an example code:
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
Now, what I want to do is to create three new columns (Sex, Status, SexStat) based on the data in the Group and Actor columns.
For example, if Group = A and Actor = 1, then Sex = M, Status = Dom, and SexStat = DomM. If Group = A and Actor = 3, then Sex = F, Status = Med, and SexStat = MedF (and so on).
The numbers do not always align with the same rank/sexes in every group, and with 5500 lines of data, I would love it if there was a way to not do this manually! Any help would be much appreciated.

You can create conditions for Sex and Status and then paste them to create SexStat
library(dplyr)
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
df %>%
mutate(
Sex = case_when(
Group == "A" & Actor == 1 ~ "M",
Group == "A" & Actor == 3 ~ "F",
TRUE ~ ""
),
Status = case_when(
Group == "A" & Actor == 1 ~ "Dom",
Group == "A" & Actor == 3 ~ "Med",
TRUE ~ ""
),
SexStat = paste0(Status,Sex)
)
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3

We may do this with a key/value dataset by joining
library(dplyr)
library(tidyr)
library(stringr)
keydat <- tibble(Group = "A", Actor = c(1, 3), Sex = c("M", "F"), Status = c("Dom", "Med"))
df %>%
left_join(keydat) %>%
mutate(across(c(Sex, Status), replace_na, ""),
SexStat = str_c(Status, Sex))
-output
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3

Related

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

2-group heterogeneity index

I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429

Logic for filtering dependent on two columns [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))

How to remove duplicates based on two colums with a condition?

I'd like to remove some duplicates but not all of them. I'm going to explain after showing the data i'm working with.
Here is an sample of my dataframe :
df <- data.frame("S" = c("A", "B", "C", "D", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/04/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "004", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B", "B"),
"Q" = c(1, 2, 3, 4, 5, 6),
"U" = c(rep("A", 6)),
"P" = c(2, 3, 4, 4, 7, 7),
stringsAsFactors = FALSE)
And now some code i'm applying on this dataframe :
df$P <- round(as.double(df$P), digits = 2)
df <- df[order(df$R, df$P),]
df <- df %>%
group_by(R) %>%
mutate(price = P - min(P)) %>%
ungroup()
df$Ecart <- df$price * as.double(df$Q)
df <- df %>%
group_by(R) %>%
mutate(EcartTotal = cumsum(Ecart)) %>%
ungroup()
The result I'm expecting :
result <- data.frame("S" = c("A", "B", "C", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B"),
"Q" = c(1, 2, 3, 5, 6),
"U" = c(rep("A", 5)),
"P" = c(2, 3, 4, 7, 7),
"price" = c(0, 1, 0, 3, 3),
"Ecart" = c(0, 2, 0, 15, 18),
"EcartTotal" = c(NA, 2, NA, NA, 33),
stringsAsFactors = FALSE)
So to obtain this I'd like to remove the duplicates of the column R only if their price is equal to 0.
I'd also like to replace the value of EcartTotal by NA if they are not equal to the max value for each R
We can filter based on the condition and then replace the value of 'EcartTotal' to NA after grouping by 'R'
library(dplyr)
df %>%
filter(!(duplicated(R) & price == 0)) %>%
group_by(R) %>%
mutate(EcartTotal = replace(EcartTotal, EcartTotal != max(EcartTotal), NA))
# A tibble: 5 x 12
# Groups: R [2]
# S D N R RF Des Q U P price Ecart EcartTotal
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 01/01/2019 001 ABC1 ABC1F A 1 A 2 0 0 NA
#2 B 01/02/2019 002 ABC1 ABC1F A 2 A 3 1 2 2
#3 C 01/03/2019 003 ABC2 ABC2F B 3 A 4 0 0 NA
#4 E 01/05/2019 005 ABC2 ABC2F B 5 A 7 3 15 NA
#5 F 01/06/2019 006 ABC2 ABC2F B 6 A 7 3 18 33
Or the filter after the group_by step
df %>%
group_by(R) %>%
filter(!(row_number() > 1 & price == 0)) %>%
mutate(EcartTotal = EcartTotal * NA^(EcartTotal != max(EcartTotal)))

dplyr mutate new dynamic variables with case_when

I'm aware of similar questions here and here, but I haven't been able to figure out the right solution for my specific situation. Some of what I'm finding are solutions which use mutate_, etc but I understand these are now obsolete. I'm new to dynamic usages of dplyr.
I have a dataframe which includes some variables with two different prefixes, alpha and beta:
df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"))
I want to create new variables with the prefix "chosen." which are copies of either the "alpha" or "beta" columns depending on which is named for that row in the "which.to.use" column. The desired output would be:
desired.df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"),
chosen.num = c(1, 3, 6, 8),
chosen.char = c("a", "c", "f", "h"))
My failed attempt:
varnames <- c("num", "char")
df %<>%
mutate(as.name(paste0("chosen.", varnames)) := case_when(
which.to.use == "alpha" ~ paste0("alpha.", varnames),
which.to.use == "beta" ~ pasteo("beta.", varnames)
))
I'd prefer a pure dplyr solution, and even better would be one which could be included in a longer pipe modifying the df (i.e. no need to stop to create "varnames"). Thanks for your help.
Using some fun rlang stuff & purrr:
library(rlang)
library(purrr)
library(dplyr)
df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
which.to.use = c("alpha", "alpha", "beta", "beta"),
stringsAsFactors = F)
c("num", "char") %>%
map(~ mutate(df, !!sym(paste0("chosen.", .x)) :=
case_when(
which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
))) %>%
reduce(full_join)
Result:
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h
Without reduce(full_join):
c("num", "char") %>%
map_dfc(~ mutate(df, !!sym(paste0("chosen.", .x)) :=
case_when(
which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
))) %>%
select(-ends_with("1"))
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h
Explanation:
(Note: I do not fully or even kind of get rlang. Maybe others can give a better explanation ;).)
Using paste0 by itself produces a string, when we need a bare name for mutate to know it is referring to a variable name.
If we wrap paste0 in sym, it evaluates to a bare name:
> x <- varrnames[1]
> sym(paste0("alpha.", x))
alpha.num
But mutate does not know to evaluate and instead read it as a symbol:
> typeof(sym(paste0("alpha.", x)))
[1] "symbol"
The "bang bang" !! operator evaluates the sym function. Compare:
> expr(mutate(df, var = sym(paste0("alpha.", x))))
mutate(df, var = sym(paste0("alpha.", x)))
> expr(mutate(df, var = !!sym(paste0("alpha.", x))))
mutate(df, var = alpha.num)
So with !!sym we can use paste to dynamically called variable names with dplyr.
This is a nest()/map() strategy that should be pretty fast. It stays in the tidyverse, but doesn't go into rlang land.
library(tidyverse)
df %>%
nest(-which.to.use) %>%
mutate(new_data = map2(data, which.to.use,
~ select(..1, matches(..2)) %>%
rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
unnest()
which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1 alpha 1 a 2 b 1 a
2 alpha 3 c 4 d 3 c
3 beta 5 e 6 f 6 f
4 beta 7 g 8 h 8 h
It grabs all columns, not just num and char, that are not which.to.use. But that seems like what you (I) would want IRL. You could add a select(matches('(var1|var2|etc')) line before you call nest() if you wanted to pull only specific variables.
EDIT:
My original suggestion of using select() to drop unneeded columns would result in doing a join to bring them back later. If instead you adjust the nest parameters, you can acheive this on only certain columns.
I added new bool columns here, but they will be ignored for the "choosen" selection:
new_df <- data.frame(alpha.num = c(1, 3, 5, 7),
alpha.char = c("a", "c", "e", "g"),
alpha.bool = FALSE,
beta.num = c(2, 4, 6, 8),
beta.char = c("b", "d", "f", "h"),
beta.bool = TRUE,
which.to.use = c("alpha", "alpha", "beta", "beta"),
stringsAsFactors = FALSE)
new_df %>%
nest(matches("num|char")) %>% # only columns that match this pattern get nested, allows you to save others
mutate(new_data = map2(data, which.to.use,
~ select(..1, matches(..2)) %>%
rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
unnest()
alpha.bool beta.bool which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1 FALSE TRUE alpha 1 a 2 b 1 a
2 FALSE TRUE alpha 3 c 4 d 3 c
3 FALSE TRUE beta 5 e 6 f 6 f
4 FALSE TRUE beta 7 g 8 h 8 h
A base R approach using apply with margin = 1 where we select columns for each row based on the value in which.to.use column and get the value from corresponding column for the row.
df[c("chosen.num", "chosen.char")] <-
t(apply(df, 1, function(x) x[grepl(x["which.to.use"], names(df))]))
df
# alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
#1 1 a 2 b alpha 1 a
#2 3 c 4 d alpha 3 c
#3 5 e 6 f beta 6 f
#4 7 g 8 h beta 8 h
You can also try a gather/spread approach
df %>%
rownames_to_column() %>%
gather(k,v,-which.to.use,-rowname) %>%
separate(k,into = c("k1", "k2"), sep="[.]") %>%
filter(which.to.use == k1) %>%
mutate(k1="chosen") %>%
unite(k, k1, k2,sep=".") %>%
spread(k,v) %>%
select(.,chosen.num, chosen.char) %>%
bind_cols(df, .)
alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1 1 a 2 b alpha 1 a
2 3 c 4 d alpha 3 c
3 5 e 6 f beta 6 f
4 7 g 8 h beta 8 h

Resources