R dplyr with multiple columns with same stem name - r

I have some data columns that I need to set to NA whenever a corresponding column is >0.
I can do this with mutate and the names of the two columns, but I want a scoped version where I create the name of the corresponding column from the name of the 1st column
(x<-data.frame(x1=(1:4),map.x1=c(0,0,7,0),x2=c(2,2,2,2),map.x2=c(0,7,0,0)))
mutate(x, x1=ifelse(map.x1>0, NA, x1))
mutate_at(x, vars(starts_with("x")), function(v) {
m.name <- paste0("map.", deparse(substitute(v)))
ifelse(get(m.name)>0, NA, v)
)
}
I can see ifelse() is unhappy because it wants the 1st argument to be an object, and I have given an expression.
I could not find a way. I even wondered if there was some way to avoid the function(v) and use (.) in paste0() or get()
I also am considering reshaping so I can do one mutate. What is the best practice here?

This is a solution without reshaping the data.
library(dplyr)
library(rlang)
custom_mutate <- function(df, v){
v <- enquo(v)
map.v <- paste0("map.", quo_name(v))
df %>%
mutate(UQE(v) := ifelse((!!sym(map.v)) > 0, NA, (!!v))) %>%
pull(UQE(v))
}
mutate_at(x, vars(starts_with("x")), funs(custom_mutate(df = x, v = .)))
# x1 map.x1 x2 map.x2
# 1 1 0 2 0
# 2 2 0 NA 7
# 3 NA 7 2 0
# 4 4 0 2 0
The function in the mutate_at call is only applied to the columns and not to the whole dataframe. Therefore you have to explicitly tell the function where to look for your map.x1 column.
To get the name from the column you're working with, first you need to use enquo to turn v in a quosure. Then you can use quo_name to construct the map.-name. In the following mutate call it is important that you tell dplyr, that v is a quosure (therefore the UQE wrapped around it, which is similar to the !! in front of it in the FALSE-part of the ifelse statement).
For the map.x1 column you have to use the sym-function from the rlang-package to get the bare name (without quotations) and then again use the !! to tell dplyr to take this as a column name.
I trief to explain my solution, being not to technical. For a great explanation of how to programm with dplyr see here: Programming with dplyr

Here is one way to get the output you want. No need to write a custom function. Reshaping the file should be sufficient.
library(tibble)
library(dplyr)
library(stats)
# creating dataframe with proper names
x <-
tibble::as_data_frame(cbind(
x_1 = c(1:4),
map.x_1 = c(0, 0, 7, 0),
x_2 = c(2, 2, 2, 2),
map.x_2 = c(0, 7, 0, 0)
)) %>%
tibble::rownames_to_column(df = ., var = 'id')
# converting to long format
x_long <- stats::reshape(
as.data.frame(x),
timevar = "level",
varying = dput(as.character(as.vector(names(
x[, base::grep("^x|^map", names(x))]
)))),
direction = "long",
idvar = c("id"),
sep = "_"
)
#> c("x_1", "map.x_1", "x_2", "map.x_2")
# converting the dataframe based on condition
x_long %>%
group_by(.data = ., level) %>%
dplyr::mutate(.data = .,
x = base::ifelse(test = map.x > 0,
yes = NA,
no = x))
#> # A tibble: 8 x 4
#> # Groups: level [2]
#> id level x map.x
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 1.00 1.00 0
#> 2 2 1.00 2.00 0
#> 3 3 1.00 NA 7.00
#> 4 4 1.00 4.00 0
#> 5 1 2.00 2.00 0
#> 6 2 2.00 NA 7.00
#> 7 3 2.00 2.00 0
#> 8 4 2.00 2.00 0
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).

Related

Pipe result from map_dfr to a subsequent map_dfr to apply custom fuction to groups of data

I would like to apply a certain function (namely AddLags from below) to groups of a dataframe. To achieve this, I am trying to use two consecutive map_dfr (piping one to another), so as to apply the respective filters. For the last step, I am applying the custom function (mentioned earlier) - using map_dfr (to capture the newly calculated output data in a new object).
The code I have so far is as follows:
# dummy dataset
df <- data.frame(
date = seq(today(),length.out=12,by='month'),
dim1 = c('a','a','a','b','b','b','c','c','c','d','d','d'),
dim2 = c(1,1,1,1,1,1,2,2,2,2,2,2),
value = 1:12
)
# function to apply
AddLags <- function(df,lags_vector,target_col,date_col){
temp_lags <- map_dfc(lags_vector,
~ df %>%
arrange({{date_col}}) %>%
transmute(
across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
)
)
return(temp_lags)
}
# prepare for map_dfr approach
lags_features <- c(1,2)
dims1 <- df %>% pull(dim1) %>% unique %>% sort
dims2 <- df %>% pull(dim2) %>% unique %>% sort
# what I am struggling with
map_dfr(dims1,
~ df %>%
filter(dim1==.x) %>%
map_dfr(dims2,
~ . %>%
filter(dim2==.x) %>%
AddLags(lags_features,variable,date)
)
)
# how the loop version would look like
gather_results <- data.frame()
for(d1 in dims1){
for(d2 in dims2){
tempdata <- df %>% filter(dim1==d1,dim2==dim2) %>% arrange(date)
temp <- AddLags(tempdata)
gather_results %<>% bind_rows(temp)
}
}
In essence, I am traversing through the different groups (through filtering) and applying the custom function respectively, while trying to use map_dfr to consolidate the newly calculated results.
I would like to know how to achieve the above (assuming that is feasible) and what am I missing since for the time being all I get back is an empty dataframe.
BONUS QUESTION:
As I am writing this, I realize that there has to be a better way of doing this instead of looping - for instance using a group_by - but given the nature of the problem and the fact that the function outputs new data, I am not sure how this would look like (assuming is feasible to begin with). So, any kind of suggestion/alternative/best practice would be much appreciated.
DISCLAIMER:
I a big noob when it comes to purrr functionality and not much of an experienced dplyr user either, so kindly forgive my ignorance.
Is this the expected output?
library(tidyverse)
library(lubridate)
group_split(df, dim1, dim2) %>%
map_dfr(~ .x %>% AddLags(1:2, "value", date))
#> # A tibble: 12 × 2
#> value_lag_01 value_lag_02
#> <int> <int>
#> 1 NA NA
#> 2 1 NA
#> 3 2 1
#> 4 NA NA
#> 5 4 NA
#> 6 5 4
#> 7 NA NA
#> 8 7 NA
#> 9 8 7
#> 10 NA NA
#> 11 10 NA
#> 12 11 10
Data:
# dummy dataset
df <- data.frame(
date = seq(today(), length.out = 12, by = "month"),
dim1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
dim2 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
value = 1:12
)
# function to apply
AddLags <- function(df, lags_vector, target_col, date_col) {
temp_lags <- map_dfc(
lags_vector,
~ df %>%
arrange({{ date_col }}) %>%
transmute(
across(contains(target_col), lag, .x, .names = '{col}_lag_{ifelse(.x<10,paste0("0",.x),.x)}')
)
)
return(temp_lags)
}
Created on 2022-01-13 by the reprex package (v2.0.1)
As #Limey suggested a possible way would be to use the group_map function :
results_df <- data.frame()
results_df <-
bind_rows(
df %>%
group_by(dim1,dim2) %>%
group_map(~AddLags(.,c(1,2),'value',date))
)
And the expected results would be :
value_lag_01 value_lag_02
<int> <int>
1 NA NA
2 1 NA
3 2 1
4 NA NA
5 4 NA
6 5 4
7 NA NA
8 7 NA
9 8 7
10 NA NA
11 10 NA
12 11 10
However, I personally I would go with #jpdugo17 approach

Delete specific column if all values are NA in R

I've got a database query running from R which can have multiple columns filled with NA, some of these columns I need further down the line but one specific column can be dropped if all values are NA.
I usually use purrr::discard(~all(is.na(.))) to drop the columns which are all NA, but since this dataframe can contain multiple columns with NA where I only want to remove one I'm struggling to make this specific to the column in a tidyverse solution.
I've currently got this workaround:
if(sum(is.na(Orders$Originator)) == nrow(Orders)) {
Orders <- Orders %>%
select(-Originator)
}
But it would improve the readability if I can have this in a tidyverse solution. Hope someone can be of help!
Thanks!
The canonical tidyverse way to address this problem would be to make use of a predicate function used within select(where(...)) and combine this with selection by variable name.
First we could write a custom predicate function to use in where which selects only columns that contain only NAs.
# custom predicate function
all_na <- function(x) {
all(is.na(x))
}
We can use this function together with a boolean expression saying we don't want to select y if (read AND &) it is all_na:
library(dplyr)
df <- data.frame(
x = c(1,2,NA),
y = NA,
z = c(3,4,5)
)
df %>%
select(!(y & where(all_na)))
#> x z
#> 1 1 3
#> 2 2 4
#> 3 NA 5
To check whether this is really working let's redefine y so that it contains not only NAs and we will see that this time it's not deselected:
df2 <- data.frame(
x = c(1,2,NA),
y = c(1,2,NA),
z = c(3,4,5)
)
df2 %>%
select(!(y & where(all_na)))
#> x y z
#> 1 1 1 3
#> 2 2 2 4
#> 3 NA NA 5
Instead of a custom function we can use a lambda function inside where:
df %>%
select(!(y & where(~ all(is.na(.x)))))
Created on 2021-12-07 by the reprex package (v0.3.0)
In the larger tidyverse we could also use purrr::lmap_at and select y with the .at argument and then create a lambda function saying if all(is.na(.x)) then use an empty list() (= drop the column) otherwise keep the column .x:
library(purrr)
library(dplyr)
df %>%
lmap_at("y", ~ if(all(is.na(.x))) list() else .x)
#> # A tibble: 3 x 2
#> x z
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 4
#> 3 NA 5
Created on 2021-12-07 by the reprex package (v2.0.1)
Using example data:
df <- data.frame(
x = c(1,2,NA),
y = NA,
z = c(3,4,5)
)
Here column y is the target column to check if all is.na. Your if and else will be contained in curly braces. The braces will suppress the pipe from using the first argument in a function. Note the else will keep your data frame in the pipe if the condition is false.
library(tidyverse)
df %>%
{ if (all(is.na(.$y))) select(., -y) else . }
Output
x z
1 1 3
2 2 4
3 NA 5
It seems you are trying to mix selection by name (i.e. only specific column) and by logical. Both can be done very well separately in tidyverse (use tidy selectors or where), but I am not sure how to combine them!
So here is a dirty solution that doesn't use either:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
x = c(1,2,NA),
y = NA,
y2 = NA,
z = c(3,4,5)
)
df %>%
select(-which(colnames(.)=="y" & sapply(., \(x) all(is.na(x)))))
#> x y2 z
#> 1 1 NA 3
#> 2 2 NA 4
#> 3 NA NA 5
Created on 2021-12-07 by the reprex package (v2.0.1)

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Why is the datatype of a column `<I<dbl>> and not just `<dbl>` after applying the identity function?

I have a case where I need to apply a dynamically selected function onto a column of a tibble. In some cases, I don't want the values to change at all -- then I select the identity function I().
After applying I() the datatype of the column changes from <dbl> to <I<dbl>>. Why is that? Why is it not just double again?
library(tidyverse)
df <- tibble(x = (1:3*pi))
print(df)
# A tibble: 3 x 1
# x
# <dbl>
# 1 3.14
# 2 6.28
# 3 9.42
df %<>% mutate(x = I(x))
print(df)
# A tibble: 3 x 1
# x
# <I<dbl>>` <-- Why <I...> and not <dbl>?
# 1 3.14
# 2 6.28
# 3 9.42
How can I just get ?
I() is not the identity function, technically (that would be identity). I() is to inhibit interpretation/conversion, saying that the component should be used "as is". Further I(...) returns an object of class "AsIs", which is and should be recognized as something unique from its non-I(...) counterpart. As for the effect of this class ... I don't know of any (though I don't use them regularly, so I might be missing something).
And you can still operate on this, it's just classes differently.
dput(1:3)
# 1:3
dput(I(1:3))
# structure(1:3, class = "AsIs")
tibble(x = (1:3*pi)) %>%
mutate(x = I(x)) %>%
mutate(y = x + 1)
# # A tibble: 3 x 2
# x y
# <I<dbl>> <I<dbl>>
# 1 3.14 4.14
# 2 6.28 7.28
# 3 9.42 10.4
though that new column is also "AsIs".

Create column of a tibble (or data frame) that contains a list from a long-format tibble

I have objects that have varying numbers of events at varying times. This is currently stored in a long format (using tibbles from library(tidyverse)) :
timing_tbl <- tibble(ID = c(101,101,101,102,102,103,103,103,103),
event_time = c(0,4,8,0,6,0,4,9,12))
The real data has thousands of objects, with up to 50 or so events, so I want to make this process as efficient as possible.
I would like to convert this to a pseudo-wide format, where the first column is the patient ID, and the second column is a list of the event times for that object. I can do that where the second column is a column of tibbles in the following way
tmp <- lapply(unique(timing_tbl$ID),
function(x) timing_tbl[timing_tbl$ID == x, "event_time"])
timing_tbl2 <- tibble(unique(timing_tbl$ID),tmp)
> timing_tbl2[1,2]
# A tibble: 1 x 1
tmp
<list>
1 <tibble [3 × 1]>
> timing_tbl2[[1,2]]
# A tibble: 3 x 1
event_time
<dbl>
1 0
2 4.00
3 8.00
I would prefer to store these objects as lists, as I then want to find the “distance” between each pair of objects using the following function, and I worry that extracting the vector from the list adds unnecessary processing, slowing down the calculation.
lap_exp2 <- function(x,y,tau) {
exp(-abs(x - y)/tau)
}
distance_lap2 <- function(vec1,vec2,tau) {
## vec1 is first list of event times
## vec2 is second list of event times
## tau is the decay parameter
0.5*(sum(outer(vec1,vec1,FUN=lap_exp2, tau = tau)) +
sum(outer(vec2,vec2,FUN=lap_exp2, tau = tau))
) -
sum(outer(vec1,vec2,FUN=lap_exp2, tau = tau))
}
distance_lap2(timing_tbl2[[1,2]]$event_time,timing_tbl2[[2,2]]$event_time,2)
[1] 0.8995764
If I try extracting the list instead of the tibble using [[
tmp <- lapply(unique(timing_tbl$ID),
function(x) timing_tbl[[timing_tbl$ID == x, "event_time"]])
I get the following error, which makes sense
Error in col[[i, exact = exact]] : attempt to select more than one element in vectorIndex
Is there a reasonably simple way I can extract the column from the long tibble as a list and store it in the new tibble? Is this even the right way to go about this?
I've found using tidyr::nest a good way to generate the 'list columns' I think you may be after (especially for stuffing in time series-ish sort of data). Hope the following helps!
library(dplyr)
library(tidyr)
library(purrr)
timing_tbl <- tibble(ID = c(101,101,101,102,102,103,103,103,103),
event_time = c(0,4,8,0,6,0,4,9,12))
ID_times <-
timing_tbl %>%
group_by(ID) %>%
nest(.key = "times_df") %>%
split(.$ID) %>%
map(~ .$times_df %>% unlist(use.names = F))
# > ID_times
# $`101`
# [1] 0 4 8
# $`102`
# [1] 0 6
# $`103`
# [1] 0 4 9 12
dists_long <-
names(ID_times) %>%
expand.grid(IDx = ., IDy = .) %>%
filter(IDx != IDy) %>%
rowwise() %>%
mutate(dist = distance_lap2(vec1 = ID_times[[IDx]], vec2 = ID_times[[IDy]], tau = 2))
# # A tibble: 6 x 3
# IDx IDy dist
# <fct> <fct> <dbl>
# 1 102 101 0.900
# 2 103 101 0.981
# 3 101 102 0.900
# 4 103 102 1.68
# 5 101 103 0.981
# 6 102 103 1.68

Resources