I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)
Related
I have a dataset with 51 columns and I want to add summary rows for most of these variables. Currently columns 5:48 are various metrics with each row being 1 area from 1 quarter, I am summing the metric for all quarters for each area and ultimately creating a rate. The below code works fine for doing this to one individual metric but I need to run this for 44 different columns.
example <- test %>%
group_by(Area) %>%
summarise(`Metric 1`= (sum(`Metric 1`))/(mean(Population))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
I have tried creating a for loop and using the column index values, however, that hasn't worked and just returns various errors. I've been unable to get the above script working with index values as well, the below gives an error ('Error: unexpected '=' in: " group_by_at(Local_Authority) %>% summarise(u17_12mo[5]=")
example <- test %>%
group_by_at(Area) %>%
summarise(test[5]= (sum(test[5]))/(mean(test[4]))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
Any help on setting up a for loop for this, or another way entirely would be great
Without data, its tough to help, but maybe this would work for you:
library(tidyverse)
example <- test %>%
group_by(Area) %>%
summarise(across(5:48, ~(sum(.))/(mean(Population))*10000))
I am trying to make country-level (by year) summaries of a long-form aggregated dataset that has individual-level data. I have tried using dplyr to summarize the average of the variable I am interested in to create a new dataset. However... there appears to be something wrong with my group_by because the answer is only one observation that appears to be the mean of every observation.
data named: "finaldata.giniE",
country variable: "iso3c",
year variable: "date",
individual-level variable of interest: "Ladder.Life.Present"
Note: there are more variables in my data-- could this be an issue?
country_summmary <- finaldata.giniE %>%
select(iso3c, date, Ladder.Life.Present) %>%
group_by(iso3c, date) %>%
summarize(averaged.M = mean(Ladder.Life.Present))
country_summmary
My output appears like this:> country_summmary
averaged.M
1 5.505455
Thank you!
I actually just changed something and added your suggested code to the front and it worked! Here is the code that was able to work!
library(dplyr)
country_summary <- finaldata.gini %>%
group_by(iso3c, date) %>%
select(Ladder.Life.Present) %>%
summarise_each(funs(mean))
library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))
I am using the below code to extract the summary of data with respect to column x by counting the values in column x from the dataset unique_data and arranging the count values in descending order.
unique_data %>%
group_by(x) %>%
arrange(desc(count(x)))
But, when I execute the above code i am getting the error message as below,
Error: no applicable method for 'group_by_' applied to an object of class "character"
Kindly, let me know as what is going wrong in my code. For your information the column x is of character data type.
Regards,
Mohan
The reason is the wrapping of arrange on count. We need to do this separately. If we use the same code as in the OP's post, just split up the count and arrange step in two separate pipes. The output of count is a frequency column 'n' (by default), which we arrange in descending (desc) order.
unique_data %>%
group_by(x) %>%
count(x) %>%
arrange(desc(n))
also the group_by is not needed. According to the ?count documentation
tally is a convenient wrapper for summarise that will either call n or
sum(n) depending on whether you're tallying for the first time, or
re-tallying. count() is similar, but also does the group_by for you.
So based on that, we can just do
count(unique_data, x) %>%
arrange(desc(n))
This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))