Using dplyr to conditionally replace values in a column - r

I have an example data set with a column that reads somewhat like this:
Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!

In dplyr and tidyr
dat %>%
mutate(var = replace(var, var != "Candy", "Not Candy"))
Significantly faster than the ifelse approaches.
Code to create the initial dataframe can be as below:
library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"

Assuming your data frame is dat and your column is var:
dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))

Another solution with dplyr using case_when:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
TRUE ~ 'Non-Candy'))
The syntax for case_when is condition ~ value to replace. Documentation here.
Probably less efficient than the solution using replace, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
var == 'Water' ~ 'Water',
TRUE ~ 'Neither-Water-Nor-Candy'))

No need for dplyr. Assuming var is stored as a factor already:
non_c <- setdiff(levels(dat$var), "Candy")
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
See ?levels.
This is much more efficient than the ifelse approach, which is bound to be slow:
library(microbenchmark)
set.seed(01239)
# resample data
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
timings <- replicate(50, {
# copy data to facilitate reuse
cop <- smp
t0 <- get_nanotime()
levs <- setdiff(levels(cop$var), "Candy")
levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
t1 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop = cop %>%
mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
t2 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop$var <-
factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
t3 <- get_nanotime() - t0
c(levels = t1, dplyr = t2, direct = t3)
})
x <- apply(times, 1, median)
x[2]/x[1]
# dplyr direct
# 8.894303 4.962791
That is, this is 9 times faster.

I didn't benchmark this, but at least in some cases with more than one condition, a combination of mutate and a list seems to provide an easy solution:
# assuming that all sweet things fall in one category
dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))
conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE,
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)
dat %>% mutate(sweet = conditions[var])

When you only need two values, a simple ifelse() is prettiet, I think.
Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!
dat %>%
mutate(
var = ifelse(var == "Candy", "Candy", "Non-Candy")
)

Related

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

resource efficient join and filter method

I have the following condensed data set:
tbl1 <- data.frame(Name = c(rep("A",3), rep("B",2), rep("C",3)), Dat = c(1,1,2,1,1,3,4,4),
Var1 = sample(1:8,8), Var2 = sample(1:8,8))
tbl2 <- data.frame(Name = c("A","A","B","C","C"), Dat = c(1,2,1,3,4), x = c("a","b","b","b","a"))
I need to filter from tbl1 all data sets with the condition x, found in table tbl2. This is my current solution.
tbl11 <- tbl1 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl2 <- tbl2 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl3 <- left_join(tbl11, tbl2)
tbl4 <- tbl3 %>% filter(x == "a")
Unfortunately I run into resource issues. For small tables it works. I think there are more efficient way so that I don't have to store the intermediate steps. Your help is much appreciated.
You can subset the data before joining :
tbl4 <- merge(tbl1, subset(tbl2, x == 'a'), by = c('Name', 'Dat'))
Thanks for sharing your ideas. Just for completeness, I have tested the suggested and came up with a correct and more efficient way:
tbl3 <- inner_join(filter(tbl2, x == 'a'), tbl1, by = c('Name', 'Dat'))
Inner_join is significantly faster than merge. And the order of the input is important of course.

Create a new column using mutate_at() in R

i'm trying to do some modifications to the next data frame:
df <- data.frame(
zgen = c("100003446", "100001749","100002644","100001755"),
Name_mat = c("EVEROLIMUS 10 MG CM", "GALSULFASA 5MG/5ML FAM", "IDURSULFASE 2MG/ML SOL. P/INFUSION FAM","IMIGLUCERASA 400U POL. LIOF. FAM"),
details= c("CM", "FAM", "SOL. P/INFUSION FAM","NA")
)
And i'm using mutate_at() from dplyr package to create a new column calling "type". That column can change depending of a list of characters that can appear in the columns of my data frame ("name_mat" and "details"). The code is:
df <- df %>% mutate_at(vars(one_of("Name_mat ","details")),
funs(case_when( "FAM|FRA" == TRUE ~ "FA",
"CM|COMPRIMIDO" == TRUE~ "COM",
"SOL"== TRUE~"SOL",
"CP|CAPSULA"== TRUE~"CAP",
TRUE ~ "bad_mat")))
My first time using mutate_at and i don't know how to create a new column calling "type" in my data frame "df". Finally i need something like:
ZGEN Name_mat details Type
1 100003446 EVEROLIMUS 10 MG CM CM COM
2 100001749 GALSULFASA 5MG/5ML FAM FAM FA
3 100002644 IDURSULFASE 2MG/ML SOL. P/INFUSION FAM SOL. P/INFUSION FAM FA
4 100001755 IMIGLUCERASA 400U POL. LIOF. FAM NA FA
I appreciate any help or any other point of view about how to do this.
Thanks!
try to do it this way
library(tidyverse)
library(stringr)
df %>% mutate(TYPE = case_when(
str_detect(Name_mat, pattern = "FAM") | str_detect(details, "FRA") ~ "FA",
str_detect(Name_mat, pattern = "CM") | str_detect(details, "COMPRIMODO") ~ "CM",
str_detect(Name_mat, pattern = "SOL") ~ "SOL",
str_detect(Name_mat, pattern = "CP") | str_detect(details, "CAPSULA") ~ "CAP",
TRUE ~ "bad_mat"))
We can also use
library(dplyr)
library(purrr)
library(stringr)
pat <- "\\b(FAM|FRA|CM|COMPRIMIDO|SOL|CP|CAPSULA)\\b"
nm1 <- setNames(c("FA", "FA", "COM", "COM", "SOL", "CAP", "CAP"),
c("FAM", "FRA", "CM", "COMPRIMIDO", "SOL", "CP", "CAPSULA"))
df %>%
select(Name_mat, details) %>%
map(str_extract_all, pattern = pat) %>%
transpose %>%
map_chr( ~ nm1[flatten_chr(.x)][1] ) %>%
bind_cols(df, Type = .)

dplyr group_by loop through different columns

I have the following data;
I would like to create three different dataframes using group_by and summarise dplyr functions. These would be df_Sex, df_AgeGroup and df_Type. For each of these columns I would like to perform the following function;
df_Sex = df%>%group_by(Sex)%>%summarise(Total = sum(Number))
Is there a way of using apply or lapply to pass the names of each of these three columns (Sex, AgeGrouping and Type) to these create 3 dataframes?
This will work but will create a list of data frames as your output
### Create your data first
df <- data.frame(ID = rep(10250,6), Sex = c(rep("Female", 3), rep("Male",3)),
Population = c(rep(3499, 3), rep(1163,3)), AgeGrouping =c(rep("0-14", 3), rep("15-25",3)) ,
Type = c("Type1", "Type1","Type2", "Type1","Type1","Type2"), Number = c(260,100,0,122,56,0))
gr <- list("Sex", "AgeGrouping","Type")
df_list <- lapply(gr, function(i) group_by(df, .dots=i) %>%summarise(Total = sum(Number)))
Here's a way to do it:
f <- function(x) {
df %>%
group_by(!!x) %>%
summarize(Total = sum(Number))
}
lapply(c(quo(Sex), quo(AgeGrouping), quo(Type)), f)
There might be a better way to do it, I haven't looked that much into tidyeval. I personally would prefer this:
library(data.table)
DT <- as.data.table(df)
lapply(c("Sex", "AgeGrouping", "Type"),
function(x) DT[, .(Total = sum(Number)), by = x])

Why do i got different results using SE or NSE dplyr functions

Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.
Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.
library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"),
id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))
When I use the NSE way I got 230000 rows
df1 <-
datas %>%
group_by(id1, id2) %>%
filter(datetime == max(datetime))
nrow(df1) #230000
But when I use the SE, I got only 229977 rows
ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <-
datas %>%
group_by_(ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2) #229977
My two pieces of code are equivalent right ?
Why do I experience different results ? Thanks.
You'll need to specify the .dots argument in group_by_ when giving a vector of column names.
df2 <- datas %>%
group_by_(.dots = ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2)
[1] 230000
It looks like group_by_ might take the first column name from the vector as the only grouping variable when you don't specify the .dots argument. You can check this by grouping on id1 only.
df1 <- datas %>%
group_by(id1) %>%
filter(datetime == max(datetime))
nrow(df1)
[1] 229977
(If you group just on id2 the number of rows is 229976).

Resources