I want to create a new variable at a specific location. I can create the variable with mutate and then reorder with select but I rather would prefer the tibble:add_column way of doing it.
This is a simple example with the iris dataset :
library(tidyverse)
## This works fine
iris %>% mutate(With_mutate = ifelse(Sepal.Length > 4 & Sepal.Width > 3 , TRUE, FALSE)) %>%
select(Sepal.Length:Petal.Width, With_mutate, everything()) %>%
head()
## This works also
iris %>% add_column(With_add_column = "Test", .before = "Species") %>%
head()
## This doesn't work
iris %>% add_column(With_add_column = ifelse(Sepal.Length > 4 & Sepal.Width > 3 , TRUE, FALSE), .before = "Species") %>%
head()
Error in ifelse(Sepal.Length > 2 & Sepal.Width > 1, TRUE, FALSE) :
object 'Sepal.Length' not found
I would greatly appreciate if someone could tell me why my ifelse statement doesn't work with add_column.
The reason is that mutate or summarise etc get the column value based on specifying the symbol, but here add_column wouldn't. So, we can extract the column with .$
iris %>%
add_column(With_add_column = ifelse(.$Sepal.Length > 4 &
.$Sepal.Width > 3 , TRUE, FALSE), .before = "Species") %>%
head()
#Sepal.Length Sepal.Width Petal.Length Petal.Width With_add_column Species
#1 5.1 3.5 1.4 0.2 TRUE setosa
#2 4.9 3.0 1.4 0.2 FALSE setosa
#3 4.7 3.2 1.3 0.2 TRUE setosa
#4 4.6 3.1 1.5 0.2 TRUE setosa
#5 5.0 3.6 1.4 0.2 TRUE setosa
#6 5.4 3.9 1.7 0.4 TRUE setosa
Just to make it compact, the value of logical condition is TRUE/FALSE so, we don't need an ifelse i.e.
add_column(With_add_column = .$Sepal.Length > 4 & .$Sepal.Width > 3, .before = "Species")
can replace the second step
Related
I have a column of numbers that I want to change from a count to a percentage.
This code works:
df <- df %>%
select(casualty_veh_ref, JourneyPurpose ) %>%
group_by(JourneyPurpose) %>%
summarise(Number=n()) %>%
mutate(Percentage=Number/sum(Number)*100)
df$Percentage <- paste(round(df$Percentage), "%", sep="")
But if I try to keep the piping using percent_format from the scales package:
df <- df %>%
select(casualty_veh_ref, JourneyPurpose ) %>%
group_by(JourneyPurpose) %>%
summarise(Number=n()) %>%
mutate(Percentage=Number/sum(Number)) %>%
percent_format(Percentage, suffix = "%")
I receive the error message
Error in force_all(accuracy, scale, prefix, suffix, big.mark, decimal.mark, :
object 'Percentage' not found
I don't understand why the object is not found
Try this: I've used iris for representation.
library(dplyr)
iris %>%
slice(1:4) %>%
mutate(Test=Sepal.Length/45,Test=scales::percent(Test))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa 11.33%
2 4.9 3.0 1.4 0.2 setosa 10.89%
3 4.7 3.2 1.3 0.2 setosa 10.44%
4 4.6 3.1 1.5 0.2 setosa 10.22%
It seems like dplyr::pull() and dplyr::select() do the same thing. Is there a difference besides that dplyr::pull() only selects 1 variable?
First, it makes sense to see what class each function creates.
library(dplyr)
mtcars %>% pull(cyl) %>% class()
#> 'numeric'
mtcars %>% select(cyl) %>% class()
#> 'data.frame'
So pull() creates a vector -- which, in this case, is numeric -- whereas select() creates a data frame.
Basically, pull() is the equivalent to writing mtcars$cyl or mtcars[, "cyl"], whereas select() removes all of the columns except for cyl but maintains the data frame structure
You could see select as an analogue of [ or magrittr::extract and pull as an analogue of [[ (or $) or magrittr::extract2 for data frames (an analogue of [[ for lists would be purr::pluck).
df <- iris %>% head
All of these give the same output:
df %>% pull(Sepal.Length)
df %>% pull("Sepal.Length")
a <- "Sepal.Length"; df %>% pull(!!quo(a))
df %>% extract2("Sepal.Length")
df %>% `[[`("Sepal.Length")
df[["Sepal.Length"]]
# all of them:
# [1] 5.1 4.9 4.7 4.6 5.0 5.4
And all of these give the same output:
df %>% select(Sepal.Length)
a <- "Sepal.Length"; df %>% select(!!quo(a))
df %>% select("Sepal.Length")
df %>% extract("Sepal.Length")
df %>% `[`("Sepal.Length")
df["Sepal.Length"]
# all of them:
# Sepal.Length
# 1 5.1
# 2 4.9
# 3 4.7
# 4 4.6
# 5 5.0
# 6 5.4
pull and select can take literal, character, or numeric indices, while the others take character or numeric only
One important thing is they differ on how they handle negative indices.
For select negative indices mean columns to drop.
For pull they mean count from last column.
df %>% pull(-Sepal.Length)
df %>% pull(-1)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Strange result but Sepal.Length is converted to 1, and column -1 is Species (last column)
This feature is not supported by [[ and extract2 :
df %>% `[[`(-1)
df %>% extract2(-1)
df[[-1]]
# Error in .subset2(x, i, exact = exact) :
# attempt to select more than one element in get1index <real>
Negative indices to drop columns are supported by [ and extract though.
df %>% select(-Sepal.Length)
df %>% select(-1)
df %>% `[`(-1)
df[-1]
# Sepal.Width Petal.Length Petal.Width Species
# 1 3.5 1.4 0.2 setosa
# 2 3.0 1.4 0.2 setosa
# 3 3.2 1.3 0.2 setosa
# 4 3.1 1.5 0.2 setosa
# 5 3.6 1.4 0.2 setosa
# 6 3.9 1.7 0.4 setosa
I am trying to use a combination of mutate_at and which.max to manipulate a data frame as outlined below.
#This is basically what I want to achieve
df_want <- iris %>% group_by(Species) %>% mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
#Here is my attempt at a smarter solution, but it does not work
df_attempt <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x[which.max("Petal.Width")])
#However, this works
df_test <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x + 100)
The code to produce df_attempt does not work. I get the following error message:
Error in mutate_impl(.data, dots) :
Column `Sepal.Length` must be length 50 (the group size) or one, not 0
Any ideas how I can get around this while still using mutate_at?
The standard dplyr way would be:
df_want <- iris %>%
group_by(Species) %>%
mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
df_attempt <- iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Length, Sepal.Width), funs(.[which.max(Petal.Width)]))
Result:
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5 3.5 1.4 0.2 setosa
2 5 3.5 1.4 0.2 setosa
3 5 3.5 1.3 0.2 setosa
4 5 3.5 1.5 0.2 setosa
5 5 3.5 1.4 0.2 setosa
6 5 3.5 1.7 0.4 setosa
7 5 3.5 1.4 0.3 setosa
8 5 3.5 1.5 0.2 setosa
9 5 3.5 1.4 0.2 setosa
10 5 3.5 1.5 0.1 setosa
# ... with 140 more rows
> identical(df_want, df_attempt)
[1] TRUE
Note:
With vars you can reference variables with NSE.
With funs you can reference each column with a ., which is equivalent to function(x) x
I want to rename a specific column with new name which comes as a variable in dplyr.
newName = paste0('nameY', 2017)
What I tried was
iris %>%
rename(newName = Petal.Length) %>%
head(2)
Which gives
Sepal.Length Sepal.Width newName Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
I am getting newName not nameY2017 which is normal. So I tried
iris %>%
rename_(eval(newName) = 'Petal.Length')
But then I am getting an error.
Error: unexpected '=' in "iris %>% rename_(eval(newName) ="
Is there a proper way to do it with dplyr?
I know I can do something like
names(iris)[3] <- newName
But that wouldn't be dplyr solution.
Credit and further information in this post for this dplyr 'rename' standard evaluation function not working as expected?
Your code:
newName = paste0('nameY', 2017)
iris %>%
rename(newName = Petal.Length) %>%
head(2)
Solution:
iris %>%
rename_(.dots = setNames("Petal.Length",newName)) %>%
head(2)
Output:
Sepal.Length Sepal.Width nameY2017 Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
I want to do something like this
df <- iris %>%
rowwise %>%
mutate(new_var = sum(Sepal.Length, Sepal.Width))
Except I want to do it without typing the variable names, e.g.
names_to_add <- c("Sepal.Length", "Sepal.Width")
df <- iris %>%
rowwise %>%
[some function that uses names_to_add]
I attempted a few things e.g.
df <- iris %>%
rowwise %>%
mutate(new_var = sum(sapply(names_to_add, get, envir = as.environment(.))))
but still can't figure it out. I'll take an answer that plays around with lazyeval or something that's simpler. Note that the sum function here is just a placeholder and my actual function is much more complex, although it returns one value per row. I'd also rather not use data.table
You should check out all the functions that end with _ in dplyr. Example mutate_, summarise_ etc.
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df <- iris %>%
rowwise %>% mutate_(names_to_add)
Edit
The results of the code:
df <- iris %>%
rowwise %>% mutate(new_var = sum(Sepal.Length, Sepal.Width))
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df2 <- iris %>%
rowwise %>% mutate_(new_var = names_to_add)
identical(df, df2)
[1] TRUE
Edit
I edited the answer and it solves the problem. I wonder why it was donwvoted. We use SE (standard evaluation), passing a string as an input inside 'mutate_'. More info: vignette("nse","dplyr")
x <- "Sepal.Length + Sepal.Width"
df <- mutate_(iris, x)
head(df)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length + Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3