Error using dplyr package in R - r

I am using the below code to extract the summary of data with respect to column x by counting the values in column x from the dataset unique_data and arranging the count values in descending order.
unique_data %>%
group_by(x) %>%
arrange(desc(count(x)))
But, when I execute the above code i am getting the error message as below,
Error: no applicable method for 'group_by_' applied to an object of class "character"
Kindly, let me know as what is going wrong in my code. For your information the column x is of character data type.
Regards,
Mohan

The reason is the wrapping of arrange on count. We need to do this separately. If we use the same code as in the OP's post, just split up the count and arrange step in two separate pipes. The output of count is a frequency column 'n' (by default), which we arrange in descending (desc) order.
unique_data %>%
group_by(x) %>%
count(x) %>%
arrange(desc(n))
also the group_by is not needed. According to the ?count documentation
tally is a convenient wrapper for summarise that will either call n or
sum(n) depending on whether you're tallying for the first time, or
re-tallying. count() is similar, but also does the group_by for you.
So based on that, we can just do
count(unique_data, x) %>%
arrange(desc(n))

Related

How to interpret column length error from ddplyr::mutate?

I'm trying to apply a function (more complex than the one used below, but I was trying to simplify) across two vectors. However, I receive the following error:
mutate_impl(.data, dots) :
Column `diff` must be length 2 (the group size) or one, not 777
I think I may be getting this error because the difference between rows results in one row less than the original dataframe, per some posts that I read. However, when I followed that advice and tried to add a vector to add 0/NA on the final line I received another error. Did I at least identify the source of the error correctly? Ideas? Thank you.
Original code:
diff_df <- DF %>%
group_by(DF$var1, DF$var2) %>%
mutate(diff = map2(DF$duration, lead(DF$duration), `-`)) %>%
as.data.frame()
We don't need map2 to get the difference between the 'duration' and the lead of 'duration'. It is vectorized. map2 will loop through each element of 'duration' with the corresponding element of lead(duration) which is unnecessary
DF %>%
group_by(var1, var2) %>%
mutate(diff = duration - lead(duration))
NOTE: When we extract the column with DF$duration after the group_by. it is breaking the grouping condition and get the full dataset column. Also, in the pipe, there is no need for dataset$columnname. It should be columnname (However,in certain situations, when we want to get the full column for some comparison - it can be used)

Trying to understand dplyr function - group_by

I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

Does the order in which the dplyr functions,used in pipeline matters?

I noticed that the order in which the dplyr functions when used in pipeline impacts the result. for example:
iris %>%
group_by(Species) %>%
mutate(Sum = sum(Sepal.Length))
produces different results than this:
iris %>%
mutate(Sum = sum(Sepal.Length)) %>%
group_by(Species)
Can anyone explain the reason for this, and if there are any specific order in which they have to be defined, please mention the same.
Thank you
FYI: iris is an inbuilt dataset in R,use data(iris) to load it. I was trying to add a new column, sum of sepal lengths for each species.
Yes, the order matters.
The pipe is equivalent to:
iris<-group_by(iris, Species)
iris<-mutate(iris, Sum = sum(Sepal.Length))
If you change the order, you change the result. If you group by species first, you'll have the result of the sum by species (I guess that's what you want).
However if you group by species after the sum, this sum will correspond to summing the Sepal length for all species.
Yes, the order matters because each part of the pipe is evaluated on its own, starting from the first through to the last pipe-part and the result of the previous pipe (or original dataset) is piped forward to the next following pipe-part. That means, if you use group_by after the mutate as in your example, the mutate will be done without grouping.
One side effect is that you can create complex and long pipes where you control the order of operations (by positioning them at the right part of the pipe) and you don't need to start a new pipe after an operation is finished.

Error dplyr summarise

I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')

Resources