May Know what know is there in the below code. I am trying to extract distinct values of Species under iris but not getting . I am trying to code without %>%
iris[,c(distinct("Species"))]
I am guessing you want to do this:
library(dplyr)
distinct(iris, Species)
you do not need %>% to begin with, but if you mean that you don't want to use the dplyr package, maybe you can try what #sm925 suggested as a comment: as.character(unique(iris$Species))
This will give you a vector with all unique species:
unique(iris$Species)
Related
I would like to arrange a variable called "Name" by the number of characters in their Name. I'm aware that I need the arrange() function in the package dplyr, but do not find a function in the arrange() function that helps me to arrange based on numbers of characters in the name.
So far I have come up with: arrange((Name))
Is there someone who can help me with this?
Here's a simple workaround with dplyr package and iris data:
library(dplyr)
iris %>%
mutate(Species = as.character(Species)) %>% # Convert factor to characters
arrange(nchar(Species))
I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.
I am trying to do something very simple, and yet can't figure out the right way to specify. I simply want to exclude some named columns from mutate_at. It works fine if I specify position, but I don't want to hard code positions.
For example, I want the same output as this:
mtcars %>% mutate_at(-c(1, 2), max)
But, by specifying mpg and cyl column names.
I tried many things, including:
mtcars %>% mutate_at(-c('mpg', 'cyl'), max)
Is there a way to work with names and exclusion in mutate_at?
You can use vars to specify the columns, which works the same way as select() and allows you to exclude columns using -:
mtcars %>% mutate_at(vars(-mpg, -cyl), max)
One option is to pass the strings inside one_of
mtcars %>%
mutate_at(vars(-one_of("mpg", "cyl")), max)
library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))
I noticed that the order in which the dplyr functions when used in pipeline impacts the result. for example:
iris %>%
group_by(Species) %>%
mutate(Sum = sum(Sepal.Length))
produces different results than this:
iris %>%
mutate(Sum = sum(Sepal.Length)) %>%
group_by(Species)
Can anyone explain the reason for this, and if there are any specific order in which they have to be defined, please mention the same.
Thank you
FYI: iris is an inbuilt dataset in R,use data(iris) to load it. I was trying to add a new column, sum of sepal lengths for each species.
Yes, the order matters.
The pipe is equivalent to:
iris<-group_by(iris, Species)
iris<-mutate(iris, Sum = sum(Sepal.Length))
If you change the order, you change the result. If you group by species first, you'll have the result of the sum by species (I guess that's what you want).
However if you group by species after the sum, this sum will correspond to summing the Sepal length for all species.
Yes, the order matters because each part of the pipe is evaluated on its own, starting from the first through to the last pipe-part and the result of the previous pipe (or original dataset) is piped forward to the next following pipe-part. That means, if you use group_by after the mutate as in your example, the mutate will be done without grouping.
One side effect is that you can create complex and long pipes where you control the order of operations (by positioning them at the right part of the pipe) and you don't need to start a new pipe after an operation is finished.