I have a data set of prices in which the first column is year and the rest of the columns are regions. I'm trying to count the number of negative values by year by region.
I've tried to use dplyr to group_by(year) then summarise_at() but I can't figure out the exact code to use.
Neg_Count <- select (BaseCase, Year, 'Hub1': 'Hub15')
Neg_Count <- Neg_Count %>%
group_by(Year) %>%
What is the best way to do this?
We can use summarise_at. Create a logical vector and get the sum
library(dplyr)
Neg_Count %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Hub")), ~ sum(. < 0))
Related
I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]
I have the following code, where I don't pipe through the summarise
library(tidyverse)
library(nycflights13)
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay)
Now doing
cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9148028 which is the correct value for my calculation
Now I add the %>% summarise (...) as seen below
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay) %>% summarise(count=n())
Now doing: cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9260394
So now the cov is altered. Why is this happening? From what I know, summarise should only through away all other columns that are not mentioned, and not alter value. Have I missed something, and how can I avoid that summarise alters the cov?
As already mentioned in the comments, summarise reduces the number of rows. If you need the count without changing number of rows, you can use add_count.
library(nycflights13)
library(dplyr)
temp <- flights %>%
filter_at(vars(c(dep_delay, arr_delay, distance)), all_vars(!is.na(.))) %>%
add_count(dep_delay, arr_delay)
If you then check for correlation you get the same value as earlier.
cor(temp$dep_delay, temp$arr_delay)
#[1] 0.9148027589
If there are more number of columns and you need only limited columns for your analysis, you can select relevant columns using select.
I am trying to find the country with the highest average age but I also need to filter out countries with less than 5 entries in the data frame. I tried the following but it does not work:
bil %>%
group_by(citizenship,age) %>%
mutate(n=count(citizenship), theMean=mean(age,na.rm=T)) %>%
filter(n>=5) %>%
arrange(desc(theMean))
bil is the dataset and I am trying to count how many entries I have for each country, filter out countries with less than 5 entries, find the average age for each country and then find the country with the highest average. I am confused on how to do both things at the same time. If I do one summarize at a time I lose the rest of my data.
Perhaps, this could help. Note that the parameter 'x' in count is a tbl/data.frame. So, instead of count, we group by 'citizenship' and get the frequency of values with n(), get the mean of 'age' (not sure about the 'age' as grouping variable) and do the filter
bil %>%
group_by(citizenship) %>%
mutate(n = n()) %>%
mutate(theMean = mean(age, na.rm=TRUE)) %>%
filter(n>=5) %>%
arrange(desc(theMean))
If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
I would like, when summarizing after grouping, to count the number of a specific level of another factor.
In the working example below, I would like to count the number of "male" levels in each group. I've tried many things with count, tally and so on but cannot find a straightforward and neat way to do it.
df <- data.frame(Group=replicate(20, sample(c("A","B"), 1)),
Value=rnorm(20),
Factor=replicate(20, sample(c("male","female"), 1)))
df %>%
group_by(Group) %>%
summarize(Value = mean(Value),
n_male = ???)
Thanks for your help!
We can use sum on a logical vector i.e. Factor == "male". The TRUE/FALSE will be coerced to 1/0 to get the frequency of 'male' elements when we do the sum
df %>%
group_by(Group) %>%
summarise(Value = mean(Value),
n_male = sum(Factor=="male"))