I've got a database query running from R which can have multiple columns filled with NA, some of these columns I need further down the line but one specific column can be dropped if all values are NA.
I usually use purrr::discard(~all(is.na(.))) to drop the columns which are all NA, but since this dataframe can contain multiple columns with NA where I only want to remove one I'm struggling to make this specific to the column in a tidyverse solution.
I've currently got this workaround:
if(sum(is.na(Orders$Originator)) == nrow(Orders)) {
Orders <- Orders %>%
select(-Originator)
}
But it would improve the readability if I can have this in a tidyverse solution. Hope someone can be of help!
Thanks!
The canonical tidyverse way to address this problem would be to make use of a predicate function used within select(where(...)) and combine this with selection by variable name.
First we could write a custom predicate function to use in where which selects only columns that contain only NAs.
# custom predicate function
all_na <- function(x) {
all(is.na(x))
}
We can use this function together with a boolean expression saying we don't want to select y if (read AND &) it is all_na:
library(dplyr)
df <- data.frame(
x = c(1,2,NA),
y = NA,
z = c(3,4,5)
)
df %>%
select(!(y & where(all_na)))
#> x z
#> 1 1 3
#> 2 2 4
#> 3 NA 5
To check whether this is really working let's redefine y so that it contains not only NAs and we will see that this time it's not deselected:
df2 <- data.frame(
x = c(1,2,NA),
y = c(1,2,NA),
z = c(3,4,5)
)
df2 %>%
select(!(y & where(all_na)))
#> x y z
#> 1 1 1 3
#> 2 2 2 4
#> 3 NA NA 5
Instead of a custom function we can use a lambda function inside where:
df %>%
select(!(y & where(~ all(is.na(.x)))))
Created on 2021-12-07 by the reprex package (v0.3.0)
In the larger tidyverse we could also use purrr::lmap_at and select y with the .at argument and then create a lambda function saying if all(is.na(.x)) then use an empty list() (= drop the column) otherwise keep the column .x:
library(purrr)
library(dplyr)
df %>%
lmap_at("y", ~ if(all(is.na(.x))) list() else .x)
#> # A tibble: 3 x 2
#> x z
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 4
#> 3 NA 5
Created on 2021-12-07 by the reprex package (v2.0.1)
Using example data:
df <- data.frame(
x = c(1,2,NA),
y = NA,
z = c(3,4,5)
)
Here column y is the target column to check if all is.na. Your if and else will be contained in curly braces. The braces will suppress the pipe from using the first argument in a function. Note the else will keep your data frame in the pipe if the condition is false.
library(tidyverse)
df %>%
{ if (all(is.na(.$y))) select(., -y) else . }
Output
x z
1 1 3
2 2 4
3 NA 5
It seems you are trying to mix selection by name (i.e. only specific column) and by logical. Both can be done very well separately in tidyverse (use tidy selectors or where), but I am not sure how to combine them!
So here is a dirty solution that doesn't use either:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
x = c(1,2,NA),
y = NA,
y2 = NA,
z = c(3,4,5)
)
df %>%
select(-which(colnames(.)=="y" & sapply(., \(x) all(is.na(x)))))
#> x y2 z
#> 1 1 NA 3
#> 2 2 NA 4
#> 3 NA NA 5
Created on 2021-12-07 by the reprex package (v2.0.1)
Related
I have a tibble, and I want to add columns to it using a character vector. The names of the vector are to be the names of the new columns, and the new columns should be filled with the values of the vector (repeated for each row). This is easy to do with a for loop, but I'm trying to understand how across works and I'm having two problems.
cv <- c("a"="x", "b"="y", "c"="z")
tib <- tibble(c1=1:5)
myf <- function(x) { cv[x]}
tib %>% mutate(across(all_of(names(cv))), myf) ## first problem
# Error: Problem with `mutate()` input `..1`.
# x Can't subset columns that don't exist.
# x Columns `a`, `b`, and `c` don't exist.
tib %>% mutate_at(all_of(names(cv)), myf)
# ℹ Input `..1` is `across(all_of(names(cv)))`.
for (x in names(cv)) { ## do it with a for loop
tib[[x]] <- myf(x)
}
tib %>% mutate(across(all_of(names(cv)), myf)) ## second problem
which produces:
# A tibble: 5 x 4
c1 a b c
<int> <chr> <chr> <chr>
1 1 NA NA NA
2 2 NA NA NA
3 3 NA NA NA
4 4 NA NA NA
5 5 NA NA NA
Replacing the last line with tib %>% mutate_at(all_of(names(cv)), myf) produces the same incorrect behavior.
The first problem is that mutate across doesn't seem to like making new columns for some reason I can't understand. The second problem is that across doesn't know what to do with myf. It seems to want some kind of closure that I don't know how to create. (Same with mutate_at.) I've looked briefly at rlang but can't make heads or tails of how to convert a regular function into the appropriate kind of object.
The across can be used when the columns exist in the dataset and if we want to update those columns or create new columns from that column by specifying the .names to change the column name. Here, one method would be to loop over the names with map, create the columns with transmute and bind those with original data
library(purrr)
library(dplyr)
map_dfc(names(cv), ~ tib %>%
transmute(!! .x := myf(.x))) %>%
bind_cols(tib, .)
-output
# A tibble: 5 x 4
# c1 a b c
# <int> <chr> <chr> <chr>
#1 1 x y z
#2 2 x y z
#3 3 x y z
#4 4 x y z
#5 5 x y z
I have a data frame with a large number of string columns.
Each of those columns consists of strings with three parts which I would like split. So in the end the total number of string columns would triple.
When doing that I would additionally like to directly name the new columns by attaching certain predefined strings to their original column name.
As a simplified example
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
x y
a1! A1$
b2# G2%
c3$ NA
should become something like
x_letter x_number x_sign y_letter y_number y_sign
a 1 ! A 1 $
b 2 # G 2 %
c 3 $ NA NA NA
The order of the elements within the string is always the same.
The real data frame has over 100 string columns that all can be split into they three parts using a separator. The only exception might be rows where a string is missing.
I've looked into combinations of str_split_fixed(), strsplit() and separate() and apply functions but couldn't figure out how to directly name the columns while also looping over the columns.
What would be a simple approach here?
This should be what you need, not the cleanest solution but simple
library(tidyverse)
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
pipe_to_do <- . %>%
str_split_fixed(string = .,pattern = "(?<=.)",n = 3) %>%
as_tibble() %>%
rename(letter = V1,
number = V2,
sign = V3)
xx <- test_frame %>%
summarise(across(everything(),.fns = pipe_to_do))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
names_xx <- names(xx)
combine_names <- function(df,name) {
str_c(name,"_",df)
}
combine_names_func <- function(df,name){
df %>%
rename_with(.fn = ~ combine_names(.x,name))
}
map2(xx,names_xx,combine_names_func) %>%
reduce(bind_cols)
#> # A tibble: 3 x 6
#> x_letter x_number x_sign y_letter y_number y_sign
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 1 ! "A" "1" "$"
#> 2 b 2 # "G" "2" "%"
#> 3 c 3 $ "" "" ""
Created on 2020-08-04 by the reprex package (v0.3.0)
You can use str_extract:
library(stringr)
df <- data.frame(
x_letter = str_extract(test_frame$x,"^[a-z]"),
x_number = str_extract(test_frame$x,"(?<=^[a-z])[0-9]"),
x_sign = str_extract(test_frame$x,".$"),
y_letter = str_extract(test_frame$y,"^[A-Z]"),
y_number = str_extract(test_frame$y,"(?<=^[A-Z])[0-9]"),
y_sign = str_extract(test_frame$y,".$")
)
Result:
df
x_letter x_number x_sign y_letter y_number y_sign
1 a 1 ! A 1 $
2 b 2 # G 2 %
3 c 3 $ <NA> <NA> <NA>
I am trying to get some unique combinations of two variables.
For each value of x, I would like to have this unique y value, and drop those have several y values. But several x values could share same y value.
For example,
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6)),
and I would like to get the output like:
b=data.frame(x=c(2,4,5),y=c(3,3,6))
I have tried unique(), but it does not help this situation.
Thank you!
First we use unique to omit repeated rows with the same x and y values (keeping only one copy of each). Any repeated x values that are left have different y values, so we want to get rid of them. We use the standard way to remove all copies of any duplicated values as in this R-FAQ.
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
b = unique(a)
b = b[!duplicated(b$x) & !duplicated(b$x, fromLast = TRUE), ]
b
# x y
# 3 2 3
# 4 4 3
# 5 5 6
Fans of dplyr would probably do it like this, producing the same result.
library(dplyr)
a %>%
group_by(x) %>%
filter(n_distinct(y) == 1) %>%
distinct
Using dplyr:
library(dplyr)
a <- data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
a %>%
distinct() %>%
add_count(x) %>% # adds an implicit group_by(x)
filter(n == 1) %>%
select(-n)
#> # A tibble: 3 x 2
#> # Groups: x [3]
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 4 3
#> 3 5 6
Created on 2018-11-14 by the reprex package (v0.2.1)
I have some data columns that I need to set to NA whenever a corresponding column is >0.
I can do this with mutate and the names of the two columns, but I want a scoped version where I create the name of the corresponding column from the name of the 1st column
(x<-data.frame(x1=(1:4),map.x1=c(0,0,7,0),x2=c(2,2,2,2),map.x2=c(0,7,0,0)))
mutate(x, x1=ifelse(map.x1>0, NA, x1))
mutate_at(x, vars(starts_with("x")), function(v) {
m.name <- paste0("map.", deparse(substitute(v)))
ifelse(get(m.name)>0, NA, v)
)
}
I can see ifelse() is unhappy because it wants the 1st argument to be an object, and I have given an expression.
I could not find a way. I even wondered if there was some way to avoid the function(v) and use (.) in paste0() or get()
I also am considering reshaping so I can do one mutate. What is the best practice here?
This is a solution without reshaping the data.
library(dplyr)
library(rlang)
custom_mutate <- function(df, v){
v <- enquo(v)
map.v <- paste0("map.", quo_name(v))
df %>%
mutate(UQE(v) := ifelse((!!sym(map.v)) > 0, NA, (!!v))) %>%
pull(UQE(v))
}
mutate_at(x, vars(starts_with("x")), funs(custom_mutate(df = x, v = .)))
# x1 map.x1 x2 map.x2
# 1 1 0 2 0
# 2 2 0 NA 7
# 3 NA 7 2 0
# 4 4 0 2 0
The function in the mutate_at call is only applied to the columns and not to the whole dataframe. Therefore you have to explicitly tell the function where to look for your map.x1 column.
To get the name from the column you're working with, first you need to use enquo to turn v in a quosure. Then you can use quo_name to construct the map.-name. In the following mutate call it is important that you tell dplyr, that v is a quosure (therefore the UQE wrapped around it, which is similar to the !! in front of it in the FALSE-part of the ifelse statement).
For the map.x1 column you have to use the sym-function from the rlang-package to get the bare name (without quotations) and then again use the !! to tell dplyr to take this as a column name.
I trief to explain my solution, being not to technical. For a great explanation of how to programm with dplyr see here: Programming with dplyr
Here is one way to get the output you want. No need to write a custom function. Reshaping the file should be sufficient.
library(tibble)
library(dplyr)
library(stats)
# creating dataframe with proper names
x <-
tibble::as_data_frame(cbind(
x_1 = c(1:4),
map.x_1 = c(0, 0, 7, 0),
x_2 = c(2, 2, 2, 2),
map.x_2 = c(0, 7, 0, 0)
)) %>%
tibble::rownames_to_column(df = ., var = 'id')
# converting to long format
x_long <- stats::reshape(
as.data.frame(x),
timevar = "level",
varying = dput(as.character(as.vector(names(
x[, base::grep("^x|^map", names(x))]
)))),
direction = "long",
idvar = c("id"),
sep = "_"
)
#> c("x_1", "map.x_1", "x_2", "map.x_2")
# converting the dataframe based on condition
x_long %>%
group_by(.data = ., level) %>%
dplyr::mutate(.data = .,
x = base::ifelse(test = map.x > 0,
yes = NA,
no = x))
#> # A tibble: 8 x 4
#> # Groups: level [2]
#> id level x map.x
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 1.00 1.00 0
#> 2 2 1.00 2.00 0
#> 3 3 1.00 NA 7.00
#> 4 4 1.00 4.00 0
#> 5 1 2.00 2.00 0
#> 6 2 2.00 NA 7.00
#> 7 3 2.00 2.00 0
#> 8 4 2.00 2.00 0
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).
First I am very new to R, and I'm aware that I may making an obvious mistake, I have searched for an answer, but maybe I'm searching for the wrong thing.
I am trying to apply a function to add a new column to a dataframe based on the contents of that row. But it looks to me like the values in the row are not being handled properly in the mutate function when using rowwise. I've tried to create a toy example to demonstrate my problem.
library(dplyr)
x<-c("A,"B")
y<-c(1,2)
df<-data.frame(x,y)
Then I have a function to create a new column called z which adds 1 to y if the value of x is "A" and adds 2 to y if the value of x is "B". Note that I have added print(x) to show what is going on.
calculatez <- function(x,y){
print(x)
if(x == "A"){
return (y+1)
}
else{
return(y+2)
}
}
I then try to use mutate:
df %>%
rowwise() %>%
mutate(z = calculatez(x,y))
and I get the following, 2 has been added to both rows, rather than 1 to the first row and the "A" and "B" have been passed into the function as 1 and 2.
[1] 1
[1] 2
Source: local data frame [2 x 3]
Groups:
x y z
1 A 1 3
2 B 2 4
If I remove the rowwise() function the "A" and "B" appear to be being passed properly, but clearly I don't get the right result.
df %>%
mutate(z = calculatez(x,y))
[1] A B
Levels: A B
x y z
1 A 1 2
2 B 2 3
Warning message:
In if (x == "A") { :
the condition has length > 1 and only the first element will be used
I can get it to work if I try to do it without writing my own function and then I don't get the error message about the length of the condition. So I don't think I understand properly what rowwise() is doing.
df %>%
mutate(z = ifelse(x=="A",y+1,y+2))
x y z
1 A 1 2
2 B 2 4
But I want to be able to use my own function, because in my real application the condition is more complicated and it will be difficult to read with lots of nested ifelse functions in the mutate function.
I can get round the problem by changing my condition to if(x==1) but that will make my code difficult to understand.
I don't want to waste your time, so sorry if I'm missing something obvious. Any tips on where I'm going wrong?
You could use rowwise with do
df %>%
rowwise() %>%
do(data.frame(., z= calculatez(.$x, .$y)))
gives the output
x y z
#1 A 1 2
#2 B 2 4
Or you could do:
df %>%
group_by(N=row_number()) %>%
mutate(z=calculatez(x,y))%>%
ungroup() %>%
select(-N)
Using a different dataset:
df <- structure(list(x = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), y = c(1, 2, 1, 2, 1)), .Names = c("x",
"y"), row.names = c(NA, -5L), class = "data.frame")
Running the above code gives:
# x y z
#1 A 1 2
#2 A 2 3
#3 B 1 3
#4 B 2 4
#5 B 1 3
If you are using data.table
library(data.table)
setDT(df)[, z := calculatez(x,y), by=seq_len(nrow(df))]
df
# x y z
# 1: A 1 2
# 2: A 2 3
# 3: B 1 3
# 4: B 2 4
# 5: B 1 3