I have some data such as:
data(iris)
I want to rename the columns such that Species is the Y variable and all other variables are the predictors.
What I have current doesn't give me the result I am looking for.
iris %>%
select(Species, everything()) %>% # move the Y variable to the "front"
rename(Y = 1) %>%
rename_at(vars(2:ncol(.)), ~ paste("X", seq(2:ncol(.)), sep = ""))
Expected output would be colnames:
Y, X1, X2, X3, X4, X5... XN
What went wrong
The mistake in your code is that it assumes the second . (in the anonymous function) is a tibble, when in fact it's really a character vector. Hence, ncol(.) is inappropriate, and instead should be length(.). Also, no need for seq() and given the output you requested, it should start from 1. In the end, you would have been fine with:
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(2:ncol(.)), ~ paste("X", 1:length(.), sep = ""))
The other answers provide alternative ways of expressing this operation. A possibly cleaner version would be
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_with(~ str_c("X", seq_along(.)), -1)
I'm rearranging your steps to avoid having to do any subsetting in creating the names. Instead, give the first column the name X0 knowing you're going to change it to Y.
library(dplyr)
iris %>%
select(Species, everything()) %>%
setNames(paste0("X", seq_along(.) - 1)) %>%
rename(Y = 1) %>%
head()
#> Y X1 X2 X3 X4
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3.0 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5.0 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
You can set the colnames directly instead of using the sometimes finicky rename functions:
iris %>%
select(Species, everything()) %>% # move the Y variable to the "front"
`colnames<-`(c('Y', paste("X", seq(2:ncol(.)), sep = ""))) %>%
head
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
This question explains why `colnames<-` works as a function in the pipe:
use %>% with replacement functions like colnames()<-
Solution with base r functions:
colnames(iris) <- c("X1", "X2", "X3", "X4", "Y") # rename columns
iris[,c(5,1,2,3,4)] # reorder
Y X1 X2 X3 X4
# 1 setosa 5.1 3.5 1.4 0.2
# 2 setosa 4.9 3.0 1.4 0.2
# 3 setosa 4.7 3.2 1.3 0.2
# 4 setosa 4.6 3.1 1.5 0.2
# 5 setosa 5.0 3.6 1.4 0.2
Related
I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.
You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df
What is the best way to dplyr::select the first occurrence of a variable with a certain prefix (and all other variables without that prefix). Or put another way, drop all variables with that prefix except the first occurrence.
library(tidyverse)
hiris <- head(iris)
#given this data.frame:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
# rowname Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa
Now lets say I want to drop all variables with prefix Sepal.Length except the first one (Sepal.Length.x) I could do:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
dplyr::select(-Sepal.Length.y, -Sepal.Length)
which works fine but I want something flexible so it will work with an arbitrary number of variables with prefix Sepal.Length e.g.:
lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
I could do something like this:
df <- lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
name_drop <- (df %>% select(matches("Sepal.Length")) %>% names())[-1]
df %>%
select(-name_drop)
but im looking to do it in a pipe and more efficiently. any suggestions?
thanks
I like this explanation of the problem:
drop all variables with that prefix except the first occurrence.
select(iris, !starts_with("Sepal")[-1])
# Sepal.Length Petal.Length Petal.Width Species
# 1 5.1 1.4 0.2 setosa
# 2 4.9 1.4 0.2 setosa
# ...
starts_with("Sepal") of course returns all columns that start with "Sepal", we can use [-1] to remove the first match, and ! to drop any remaining matches.
It does seem a little like black magic - if we were doing this in base R, the [-1] would be appropriate if we used which() to get column indices, and the ! would be appropriate if we didn't use which() and had a logical vector, but somehow the tidyselect functionality makes it work!
have a dataset like an iris, any help will be appreciated,
iris %>% head %>% mutate(sum = .[[1]] + .[[2]]) #works
iris %>% head %>% mutate(max = max(.[1], .[2])) #doesnt work
Expected answer, find the max(1st column, 2nd column)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5.0 3.6 1.4 0.2 setosa 5.0
6 5.4 3.9 1.7 0.4 setosa 5.4
many thanks in advance
We need elementwise max and this can be achieved with pmax
iris %>%
head %>%
mutate(max= pmax(.[[1]] , .[[2]]) )
The issue with max is that its usage is
max(..., na.rm = FALSE)
Here, the ... signifies
numeric or character arguments
So, it is taking the max value of all the columns passed into the function, rather than the elementwise max of the columns
The + is a different function and it is always elementwise, but if we do sum (which would be a corresponding candidate to check with max), it also does the same behavior as max
I have some data which looks like:
data(iris)
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(-c(1)), ~str_c("X", seq_along(.)))
Data:
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
I add a random variable:
d$noise <- rnorm(length(d))
I am trying to extract just the Y, X1, X2... XN variables (dynamically). What I currently have is:
d %>%
select("Y", cat(paste0("X", seq_along(2:ncol(.)), collapse = ", ")))
This doesn't work since it takes into account the noise column and doesn't work even without the noise column.
So I am trying to create a new data frame which just extracts the Y, X1, X2...XN columns.
dplyr provides two select helper functions that you could use --- contains for literal strings or matches for regular expressions.
In this case you could do
d %>%
select("Y", contains("X"))
or
d %>%
select("Y", matches("X\\d+"))
The first one works in the example you provided but would fail if you have other variables that contain any "X" character. The second is more robust in that it will only capture variables whose names are "X" followed by one or more digits.
we can also use
d %>%
select(Y, starts_with('X'))
I want to select cols using colnames and their values in a single pipe chain without referring other objects, such as NAMES <- names(d). Can I do it with select_if() ?
For example,
I can use colnames to select cols.
(select(matches(...)) is smarter treating only colnames).
library(dplyr)
d <- iris %>% select(-Species) %>% tibble::as.tibble()
d %>% select_if(stringr::str_detect(names(.), "Petal"))
And I can use the values.
d %>% select_if(~ mean(.) > 5)
But how to use both of them ? (especially OR)
Below code is what I want (of course, don't run).
d %>% select_if(stringr::str_detect(names(.), "Petal") | ~ mean(.) > 5)
Any help would be greatly appreciated.
A workaround that is not too complicated is:
d %>% select_if(stringr::str_detect(names(.), "Petal") | sapply(., mean) > 5)
# or
d %>% select_if(grepl("Petal",names(.)) | sapply(., mean) > 5)
Which gives:
# A tibble: 150 x 3
Sepal.Length Petal.Length Petal.Width
<dbl> <dbl> <dbl>
1 5.1 1.4 0.2
2 4.9 1.4 0.2
3 4.7 1.3 0.2
4 4.6 1.5 0.2
5 5.0 1.4 0.2
6 5.4 1.7 0.4
7 4.6 1.4 0.3
8 5.0 1.5 0.2
9 4.4 1.4 0.2
10 4.9 1.5 0.1
# ... with 140 more rows