Count all the NA values in one column of a dataframe - r

I looking to generate a table that takes a dataframe, counts all the NA values in each column then returns another dataframe that displays that count. I'm preferring to use dplyr tools here. I've gotten this far:
library(dplyr)
airquality %>%
group_by(Month) %>%
summarise_each(funs(sum(. == 41, na.rm = TRUE)))
This returns a table that counts all the 41's. But if I modify it to NA's like so:
airquality %>%
group_by(Month) %>%
summarise_each(funs(sum(. == "NA")))
This doesn't produce the desired output (described above). Any thoughts on how I can generate a table that counts all the NA values in each column?
Thanks in advance

try this:
airquality %>% group_by(Month) %>% summarise_each(funs(sum(is.na(.))))

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

2 Numeric Values In A Dataframe Field In R

I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))

Select last row within each group with dplyr is slow

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.
How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())

counting grouped missing values in R

Sorry if this question is fairly simple
I am new to R and I want to count by group the number of missing values in the column some_column, which are in my dataset replaces by 0 values, and then get the group which has maximum of 0 values. So did this (using package dplyr):
missing_data <- group_by(some_data,some_group, count=sum(some_column==0))
But what is weird is that I get in the count column the same number all along the dataset as if the dataset was not grouped. Someone has an idea
Ok I got it
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0))
Keeping with dplyr verbs:
missing_data <- filter(some_data, some_column == 0) %>%
group_by(some_group) %>%
summarise(count = n()) %>%
arrange(desc(count))
Here an example using mtcars dataframe
count_zero<-function(x){
sum(x==0,na.rm=TRUE)
}
aggregate(mtcars,list(cyl=(mtcars$cyl)),count_zero)
here's finally the answer
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0)) %>% arrange(desc(count))

Standardize data columns in R in subgrups

I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.

Resources