counting grouped missing values in R - r

Sorry if this question is fairly simple
I am new to R and I want to count by group the number of missing values in the column some_column, which are in my dataset replaces by 0 values, and then get the group which has maximum of 0 values. So did this (using package dplyr):
missing_data <- group_by(some_data,some_group, count=sum(some_column==0))
But what is weird is that I get in the count column the same number all along the dataset as if the dataset was not grouped. Someone has an idea
Ok I got it
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0))

Keeping with dplyr verbs:
missing_data <- filter(some_data, some_column == 0) %>%
group_by(some_group) %>%
summarise(count = n()) %>%
arrange(desc(count))

Here an example using mtcars dataframe
count_zero<-function(x){
sum(x==0,na.rm=TRUE)
}
aggregate(mtcars,list(cyl=(mtcars$cyl)),count_zero)

here's finally the answer
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0)) %>% arrange(desc(count))

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

How to add a total distance column in 'flights' dataset? DPLYR, Group_by, Ungroup

I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.

Using summarise alters the result of my computation

I have the following code, where I don't pipe through the summarise
library(tidyverse)
library(nycflights13)
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay)
Now doing
cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9148028 which is the correct value for my calculation
Now I add the %>% summarise (...) as seen below
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay) %>% summarise(count=n())
Now doing: cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9260394
So now the cov is altered. Why is this happening? From what I know, summarise should only through away all other columns that are not mentioned, and not alter value. Have I missed something, and how can I avoid that summarise alters the cov?
As already mentioned in the comments, summarise reduces the number of rows. If you need the count without changing number of rows, you can use add_count.
library(nycflights13)
library(dplyr)
temp <- flights %>%
filter_at(vars(c(dep_delay, arr_delay, distance)), all_vars(!is.na(.))) %>%
add_count(dep_delay, arr_delay)
If you then check for correlation you get the same value as earlier.
cor(temp$dep_delay, temp$arr_delay)
#[1] 0.9148027589
If there are more number of columns and you need only limited columns for your analysis, you can select relevant columns using select.

Select last row within each group with dplyr is slow

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.
How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())

Count all the NA values in one column of a dataframe

I looking to generate a table that takes a dataframe, counts all the NA values in each column then returns another dataframe that displays that count. I'm preferring to use dplyr tools here. I've gotten this far:
library(dplyr)
airquality %>%
group_by(Month) %>%
summarise_each(funs(sum(. == 41, na.rm = TRUE)))
This returns a table that counts all the 41's. But if I modify it to NA's like so:
airquality %>%
group_by(Month) %>%
summarise_each(funs(sum(. == "NA")))
This doesn't produce the desired output (described above). Any thoughts on how I can generate a table that counts all the NA values in each column?
Thanks in advance
try this:
airquality %>% group_by(Month) %>% summarise_each(funs(sum(is.na(.))))

Resources