What is the best way to dplyr::select the first occurrence of a variable with a certain prefix (and all other variables without that prefix). Or put another way, drop all variables with that prefix except the first occurrence.
library(tidyverse)
hiris <- head(iris)
#given this data.frame:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
# rowname Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa
Now lets say I want to drop all variables with prefix Sepal.Length except the first one (Sepal.Length.x) I could do:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
dplyr::select(-Sepal.Length.y, -Sepal.Length)
which works fine but I want something flexible so it will work with an arbitrary number of variables with prefix Sepal.Length e.g.:
lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
I could do something like this:
df <- lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
name_drop <- (df %>% select(matches("Sepal.Length")) %>% names())[-1]
df %>%
select(-name_drop)
but im looking to do it in a pipe and more efficiently. any suggestions?
thanks
I like this explanation of the problem:
drop all variables with that prefix except the first occurrence.
select(iris, !starts_with("Sepal")[-1])
# Sepal.Length Petal.Length Petal.Width Species
# 1 5.1 1.4 0.2 setosa
# 2 4.9 1.4 0.2 setosa
# ...
starts_with("Sepal") of course returns all columns that start with "Sepal", we can use [-1] to remove the first match, and ! to drop any remaining matches.
It does seem a little like black magic - if we were doing this in base R, the [-1] would be appropriate if we used which() to get column indices, and the ! would be appropriate if we didn't use which() and had a logical vector, but somehow the tidyselect functionality makes it work!
Related
My question is very similar to this one, but I would prefer to have a tidyverse approach.
I have a dataset with several columns and I want to split it columnwise (not rowwise!), but keep a list of common columns in every dataset. To illustrate this, I will use the iris dataset, and let's say that Species is the common column that I want to keep.
It would be really easy to do it using just these simple operations:
iris1 <- iris[,c("Species", "Sepal.Width")]
iris2 <- iris[,c("Species", "Sepal.Length")]
iris3 <- iris[,c("Species", "Petal.Width")]
iris4 <- iris[,c("Species", "Petal.Length")]
So I want to achieve the same output as that, but in a tidyverse style and usable in a pipeline without breaking it.
One approach could be to make a function that extracts from iris the Species and the column (number or name) of your choice, then map those column numbers into your function.
library(dplyr)
make_df <- function(col) { iris %>% select(Species, {{ col }} )}
c(2,1,4,3) %>% purrr::map(make_df)
or as one line:
c(2,1,4,3) %>% map(~iris %>% select(Species, {{ .x }}))
This will output a list with four elements, each of which is a data frame like you describe. For many workflows that will be safer and more convenient than creating four free-floating data frames in the global environment.
c(2,1,4,3) %>% map(make_df) %>% map(head)
[[1]]
Species Sepal.Width
1 setosa 3.5
2 setosa 3.0
3 setosa 3.2
4 setosa 3.1
5 setosa 3.6
6 setosa 3.9
[[2]]
Species Sepal.Length
1 setosa 5.1
2 setosa 4.9
3 setosa 4.7
4 setosa 4.6
5 setosa 5.0
6 setosa 5.4
[[3]]
Species Petal.Width
1 setosa 0.2
2 setosa 0.2
3 setosa 0.2
4 setosa 0.2
5 setosa 0.2
6 setosa 0.4
[[4]]
Species Petal.Length
1 setosa 1.4
2 setosa 1.4
3 setosa 1.3
4 setosa 1.5
5 setosa 1.4
6 setosa 1.7
I want to use the str_detectfunction passing a variable as the first argument. Meaning this could theoretically look something like this.
# create the variable
var = names(mtcars)[1]
mtcars %>%
mutate(
new_var = case_when(str_detect(var, "^2"), "two", "other")
)
Now I'm not sure how to insert the variable var correctly into the str_detect function. I guess some tidy-eval is necessary, but I'm not sure....
using mtcars as an exmaple for string manipulation is not very helpful, so switching over to iris. Also, your case_when specification was wrong, so I'm using if_else for this example.
You can use !!(sym(var))
library(tidyverse)
var <- "Species"
iris %>%
mutate(
new_var = if_else(str_detect(!!sym(var), "set"), "two", "other")
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa two
2 4.9 3.0 1.4 0.2 setosa two
3 4.7 3.2 1.3 0.2 setosa two
4 4.6 3.1 1.5 0.2 setosa two
5 5.0 3.6 1.4 0.2 setosa two
6 5.4 3.9 1.7 0.4 setosa two
have a dataset like an iris, any help will be appreciated,
iris %>% head %>% mutate(sum = .[[1]] + .[[2]]) #works
iris %>% head %>% mutate(max = max(.[1], .[2])) #doesnt work
Expected answer, find the max(1st column, 2nd column)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5.0 3.6 1.4 0.2 setosa 5.0
6 5.4 3.9 1.7 0.4 setosa 5.4
many thanks in advance
We need elementwise max and this can be achieved with pmax
iris %>%
head %>%
mutate(max= pmax(.[[1]] , .[[2]]) )
The issue with max is that its usage is
max(..., na.rm = FALSE)
Here, the ... signifies
numeric or character arguments
So, it is taking the max value of all the columns passed into the function, rather than the elementwise max of the columns
The + is a different function and it is always elementwise, but if we do sum (which would be a corresponding candidate to check with max), it also does the same behavior as max
I am learning tidyr and doing a small exercise to transform iris data set from wide to long.
The original data set:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
The resulting data set I want:
Species Part Length Width
1 setosa Petal 1.4 0.2
2 setosa Petal 1.4 0.2
3 setosa Petal 1.3 0.2
4 setosa Petal 1.5 0.2
5 setosa Petal 1.4 0.2
6 setosa Petal 1.7 0.4
The code I wrote for manipulating data set:
iris_re <- iris[,c(5,1,2,3,4)]
iris.wide <- iris_re %>%
gather(key = "flower_att", value = "measurement",
-Species) %>%
separate(flower_att, into = c("Part","Method")) %>%
spread(Method,measurement)
But the final line of spread() gives me an error:
Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 400 rows:
I did not expect this happen and I am still struggling with it. Thank you!
We can use pivot_longer from tidyr, which can also take multiple columns
library(dplyr)
library(tidyr)
iris_re %>%
pivot_longer(cols = -Species, names_to = c("Part", ".value"), names_sep= "[.]") %>%
head
# Species Part Length Width
#1 setosa Sepal 5.1 3.5
#2 setosa Petal 1.4 0.2
#3 setosa Sepal 4.9 3.0
#4 setosa Petal 1.4 0.2
#5 setosa Sepal 4.7 3.2
#6 setosa Petal 1.3 0.2
The error in spread can occur when there are more than one unique combinations exist. With pivot_wider, it is now replaced with a warning and would return a list column if there are duplicates and then we can unnest. Or another way is to create a sequence column grouped by the column identifier that have duplicates to make a unique row identifier i.e.
iris_re %>%
gather(key = "flower_att", value = "measurement",
-Species) %>%
separate(flower_att, into = c("Part","Method")) %>%
group_by(Species, Part, Method) %>%
mutate(rn = row_number()) %>%
ungroup %>%
spread(Method,measurement)
I created a dummy function to get the lag of one variable and I want to use it with other tidyverse functions. It works after I call mutate but not after calling group_by. It throws the following error:
Error in mutate_impl(.data, dots) :
Not compatible with STRSXP: [type=NULL].
Here is a repex:
#create a function to lag a selected variable
lag_func <- function(df, x) {
mutate(df, lag = lag(df[,x]))
}
#works
iris %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
#doesn't work
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
Any idea what the error means and/or how to fix it?
The best way to pass a column name as an argument to a tidyverse function is convert it to quosure using enquo(). See this code:
lag_func <- function(df, x) {
x <- enquo(x)
mutate(df, lag = lag(!!x)) # !! is to evaluate rather than quoting (x)
}
Now let's try our function:
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func(Petal.Length)
# A tibble: 150 x 7
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lead lag
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.4 NA
2 4.9 3 1.4 0.2 setosa 1.3 1.4
3 4.7 3.2 1.3 0.2 setosa 1.5 1.4
4 4.6 3.1 1.5 0.2 setosa 1.4 1.3
5 5 3.6 1.4 0.2 setosa 1.7 1.5
6 5.4 3.9 1.7 0.4 setosa 1.4 1.4
7 4.6 3.4 1.4 0.3 setosa 1.5 1.7
8 5 3.4 1.5 0.2 setosa 1.4 1.4
9 4.4 2.9 1.4 0.2 setosa 1.5 1.5
10 4.9 3.1 1.5 0.1 setosa 1.5 1.4
# ... with 140 more rows
For more info on how to use tidyverse functions within your custom functions see here