The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.
Related
I'm working with a modified version of the babynames dataset, which can be gotten by installing the babynames packages and calling:
# to install the package
install.packages('babynames')
# to load the package
library(babynames)
# to get the only one dataframe of interest from the package
babynames <- babynames::babynames
# the modified data that I'm working with
babynames_prop_sexes <- babynames %>%
select(-prop, -year) %>%
group_by(name, sex) %>%
mutate(total_occurence = sum(n))
I need to sort out names that have more than 10000 occurrences for both sexes. How can I approach this? (Preferably by using dplyr but any method is welcomed.)
Thanks in advance for any help!
There might be a more elegant solution. But this should get you a list of names that appear with > 10000 entries as both an M and an F.
For the method, I just kept going with dplyr verbs. After using filter to get rid of the entries that appear < 10000 times, I could then group_by the name and use tally(), knowing that n = 2 when that entry appeared twice, once for M and once for F.
large_total_both_genders_same_name <- babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
filter(total > 10000) %>%
arrange(name) %>%
group_by(name) %>%
tally() %>%
arrange(desc(n)) %>%
filter(n == 2) %>%
dplyr::select(name)
And if you want to filter your original file by that shortlist of names you can use a semi_join on the table we created, to shorten up the list. In this case, it wouldn't be obvious what you are looking at unless you also included the year column, which you removed.
original_babynames_shortened <- babynames_prop_sexes %>%
filter(name %in% large_total_both_genders_same_name$name)
But anyway, this is a common process. Create a summary table of some kind that is saved as its own 'intermediary' table, so to speak, then join that to your original, as a filter. Sometimes this can all be done in one go, but it's often easier, in my opinion to break this into two pieces.
I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.
I have the following code, where I don't pipe through the summarise
library(tidyverse)
library(nycflights13)
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay)
Now doing
cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9148028 which is the correct value for my calculation
Now I add the %>% summarise (...) as seen below
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay) %>% summarise(count=n())
Now doing: cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9260394
So now the cov is altered. Why is this happening? From what I know, summarise should only through away all other columns that are not mentioned, and not alter value. Have I missed something, and how can I avoid that summarise alters the cov?
As already mentioned in the comments, summarise reduces the number of rows. If you need the count without changing number of rows, you can use add_count.
library(nycflights13)
library(dplyr)
temp <- flights %>%
filter_at(vars(c(dep_delay, arr_delay, distance)), all_vars(!is.na(.))) %>%
add_count(dep_delay, arr_delay)
If you then check for correlation you get the same value as earlier.
cor(temp$dep_delay, temp$arr_delay)
#[1] 0.9148027589
If there are more number of columns and you need only limited columns for your analysis, you can select relevant columns using select.
I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.
How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())
I am trying to find the country with the highest average age but I also need to filter out countries with less than 5 entries in the data frame. I tried the following but it does not work:
bil %>%
group_by(citizenship,age) %>%
mutate(n=count(citizenship), theMean=mean(age,na.rm=T)) %>%
filter(n>=5) %>%
arrange(desc(theMean))
bil is the dataset and I am trying to count how many entries I have for each country, filter out countries with less than 5 entries, find the average age for each country and then find the country with the highest average. I am confused on how to do both things at the same time. If I do one summarize at a time I lose the rest of my data.
Perhaps, this could help. Note that the parameter 'x' in count is a tbl/data.frame. So, instead of count, we group by 'citizenship' and get the frequency of values with n(), get the mean of 'age' (not sure about the 'age' as grouping variable) and do the filter
bil %>%
group_by(citizenship) %>%
mutate(n = n()) %>%
mutate(theMean = mean(age, na.rm=TRUE)) %>%
filter(n>=5) %>%
arrange(desc(theMean))