sort tibble rows by descending count of NAs - r

Apologies if I have missed where this has been asked before - I couldn't find it. I am learning about tibbles and the verb arrange. I am wondering if there is a more efficient way to arrange rows in descending order of total row NA count then I have done below? I am using the nycflights13 dataset.
library(nycflights13)
library(tidyverse)
options(tibble.width = Inf)
flights$na_count <- flights %>% is.na %>% rowSums
arrange(flights, desc(na_count))
Part of result:
I checked the help centre and this appears to be on topic though I know working code is often a candidate for code review.

Not sure whether much more efficient, however, you can rewrite it into:
flights %>%
arrange(desc(rowSums(is.na(.))))

Related

R mutate which.max by group

I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])

How to create a table showing the biggest values in a large csv file in R with dplyr? [duplicate]

This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 1 year ago.
I have a very big csv file and im trying to find the amount of times a value has been repeated in a column.
csv file im using: https://www.kaggle.com/nyphil/perf-history
this is what ive been trying to do.
library(dplyr)
repeatedcomposers<-table(ny_philarmonic$composerName)
this works but only gives me 1000 values instead of the 2767 composers in the dataframe.
I also need it to create a separate dataframe so i can use it later.
The main dplyr verbs (e.g., mutate(), arrange(), etc) always return dataframes. So if you are looking to do some kind of operation that results in an operation, you are correct that a dplyr-centric approach is probably a good place to start. Base R functions are often vector-centric, so something like table() will often require additional steps afterward, if you want a dataframe in the end.
Once you've committed to dplyr, You have at least two options for this particular dilemma:
Option 1
The count() function gets you there in one step.
df %>%
count(composerName) %>%
arrange(-n) # to bring the highest count to the top
Option 2
Although it is one more line, I personally prefer the more verbose option because it helps me see what is happening more easily.
df %>%
group_by(composerName) %>%
summarise(n = n()) %>%
arrange(-n) # to bring the highest count to the top
It has the added benefit that I can role right into additional summarize() commands that I might care about too.
df %>%
group_by(composerName) %>%
summarise(
n = n(),
n_sq = n^2) # a little silly here, but often convenient in other contexts
Consider data.table for large datasets
EDIT: I would be remiss if I failed to mention the data.table might be worth looking into for this larger dataset. Although dplyr is optimized for readibility, it often slows down with datasets with more than 100k rows. In contrast, the data.table package is designed for speed with large datasets. If you are an R-focused person who often runs into large datasets, it's worth the time to look into. Here is a good comparison

In R, how to sum multiple columns based on value of another column?

In R, I have a dataframe, so that I have One Variable (the name of a country), a number of variables (Population, Number of cars, etc) and then a Column that represents region.
I would like to sum the variables (1, 2, ....) based on the value of the last region. I think this should be possible with dplyr and summarise each, but I cannot get it to work.
Would someone be able to help me please? Thanks a lot.
Reading the response (althought this may change if you can get some of your dataframe together...
library(dplyr)
summarized_df <- df %>%
group_by(region) %>%
summarise(var1=sum(variable1), var2=sum(variable2), var3=sum(variable3))
If this doesn't seem to work, maybe you can post your code and the errors even if you can't post the dataframe.

Trying to understand dplyr function - group_by

I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)

Multiple Response Questions using Dplyr and Tidyr

I'm struggling with multiple response questions in R. I'm hoping to find an easy way to tackle this with dplyr and tidyr. Below is a sample multiple respose data frame. I'm trying to do things,first, create percentages - % of cats,% of dogs, etc. Percentages will be of overall responses. My usual of calculating percentages -
group_by(_)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
doesn't seem to cut it in this situation. Maybe I have to use summarise_each or a more specialized function? I'm still new to r and really new to Dplyr and Tidyr. I also tried to use Tidyr's "unite" function, which works, but it includes NA's, which I will have to recode away. But I still can't seem to calculate the percentages of the united column.
Any suggestions would be great! First, how to unite the multiple response columns using "unite" into all possible combinations and then calculating percentages of each, and also how to simply calculate the percentage of each binary column as a proportion of overall responses? Hope this makes sense! I'm sure there's a simple and elegant answer that I'm overlooking.
Cats<-c(Cat,NA,Cat,NA,NA,NA,Cat,NA)
Dogs<-c(NA,NA,Dog,Dog,NA,Dog,NA,Dog)
Fish<-c(NA,NA,Fish,NA,NA,NA,Fish,Fish)
Pets<-data.frame(Cats,Dogs,Fish)
Pets<-Pets%>%unite(Combined,Cats,Dogs,Fish,sep=",",remove=FALSE)
Animals%>%group_by(Combined)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
Sounds like what you're trying to do can be done by 'gather()' function from tidyr instead of 'unite()' function, based on my understanding of your question.
library(dplyr)
library(tidyr)
Pets %>%
gather(animal, type, na.rm = TRUE) %>%
group_by(animal) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count))

Resources