Elegant way to report missing values in a data.frame - r

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply or colSums on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0

My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.

We can use map_df with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0

summary(airquality)
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
library("VIM")
aggr(airquality)

Another graphical alternative - plot_missing function from excellent DataExplorer package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).

More succinct-: sum(is.na(x[1]))
That is
x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0

Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.

For one more graphical solution, visdat package offers vis_miss.
library(visdat)
vis_miss(airquality)
Very similar to Amelia output with a small difference of giving %s on missings out of the box.

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})

Another graphical and interactive way is to use is.na10 function from heatmaply library:
library(heatmaply)
heatmaply(is.na10(airquality), grid_gap = 1,
showticklabels = c(T,F),
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))
Probably won't work well with large datasets..

A dplyr solution to get the count could be:
summarise_all(df, ~sum(is.na(.)))
Or to get a percentage:
summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:
is_missing <- function(x){
missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}
# sample ugly data
df <- data.frame(a = c(NA, '1', ' ', 'missing'),
b = c(0, 2, NaN, 4),
c = c('NA', 'b', '-9', 'null'),
d = 1:4,
e = c(1, Inf, -Inf, 0))
# counts:
> summarise_all(df, ~sum(is_missing(.)))
a b c d e
1 3 1 3 0 2
# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
a b c d e
1 0.75 0.25 0.75 0 0.5

If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))

ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

For piping you could write:
# Counts
df %>% is.na() %>% colSums()
# % of missing rounded to 2 decimals
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))

Related

Continuous data binning based on observation distribution/frequency to decide bin range r dplyr

I have now for days without luck scanned the internet for help on this issue. Any suggestions would be highly appreciated! (especially in a tidyverse-friendly syntax)
I have a tibble with approx. 4300 rows/obs and 320 columns. One column is my dependent variable, a continuous numeric column called "RR" (Response Ratios). My goal is to bin the RR values into 10 factor levels. Later for Machine Learning classification.
I have experimented with the cut() function with this code:
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
But I have no luck, because my bins come out with equal n in each, however, that does not reflect the distribution of the data.. I have studied a bit here https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b but I simply cannot achieve the same in R.
Here is the distribution of my RR data values grouped into classes *not what I want
You can try hist() to get the breaks. It's for plotting histograms but it also provides other associated data as side effect. In the example below, the plot is suppressed by plot = FALSE to expose the breaks data. Then, use that in cut(). This should give you the cutoffs, maintaining the distribution of the variable.
hist(iris$Sepal.Length, breaks = 5, plot = FALSE)
# $breaks
# [1] 4 5 6 7 8
#
# $counts
# [1] 32 57 49 12
#
# ...<omitted>
breaks <- hist(iris$Sepal.Length, breaks = 5, plot = FALSE)$breaks
dat <- iris %>%
mutate(sepal_length_group = cut(Sepal.Length, breaks = breaks))
dat %>%
count(sepal_length_group)
# sepal_length_group n
# 1 (4,5] 32
# 2 (5,6] 57
# 3 (6,7] 49
# 4 (7,8] 12
Thank you!
I also experimented using cut() and then count(). Then I use the labels=FALSE to give labels that can be used in a new mutate for a new column with character names of the intervals groups..
numbers_of_bins = 10
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
head(df$RR_MyQuantile,10)
df %>%
group_by(RR_MyQuantile) %>%
count()

row index of "looked at" row case_when in R

I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.

How to group, pivot and count in R

Using the GermanCredit dataset from the caret library.
library("caret")
data(GermanCredit)
After filtering this down a bit
credit.all <- GermanCredit[,c(10, 1:9, 11:13, 16:19)]
attach(credit.all)
names(credit.all)
We have these names
[1] "Class" "Duration"
[3] "Amount" "InstallmentRatePercentage"
[5] "ResidenceDuration" "Age"
[7] "NumberExistingCredits" "NumberPeopleMaintenance"
[9] "Telephone" "ForeignWorker"
[11] "CheckingAccountStatus.lt.0" "CheckingAccountStatus.0.to.200"
[13] "CheckingAccountStatus.gt.200" "CreditHistory.ThisBank.AllPaid"
[15] "CreditHistory.PaidDuly" "CreditHistory.Delay"
[17] "CreditHistory.Critical"
What I need to do is pivot at summarize on two of these columns, something I know how to do in SQL like this.
SELECT
Class
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) AS Critical
, SUM(CASE WHEN `CreditHistory.Critical` = 0 THEN 1 ELSE 0 END) AS NotCritical
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) / COUNT(*) AS PctCritical
FROM `credit.all`
GROUP BY
Class
Which would produce something like this
However, I am struggling mightily to get a foothold in R, using books and Google, it seems I should use reshape2 melt and dcast to achieve something like this. What I have tried are basically variants of this:
library(reshape2)
credit.melted <- melt(credit.all[,c(1,17)], ID=c("name", "Class"))
dcast(credit.melted, Class~CreditHistory.Critical, nrow, fill=0)
But all of my attempts with these functions have produced errors too cryptic and too common to understand what I am doing wrong.
Error in vapply(indices, fun, .default) : values must be length 1,
but FUN(X[[1]]) result is length 0
Sometimes my random permutations of the function calls produce slightly different error output, but nothing that points me in the right direction.
Question: How can I do the pivoted summary similar to the SQL result using R?
I wouldn't really consider this a pivot. You're not trying to use a pivot command in SQL. You can use dplyr to follow the exact same method as your SQL:
library(dplyr)
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical == 1),
NotCritical = sum(CreditHistory.Critical == 0),
PctCritical = mean(CreditHistory.Critical == 1))
# # A tibble: 2 x 4
# Class Critical NotCritical PctCritical
# <fct> <int> <int> <dbl>
# 1 Bad 50 250 0.167
# 2 Good 243 457 0.347
Since it's a binary column the == 1 isn't really necessary, but I leave it in because (a) it's more similar to your SQL code, and (b) if there were other values but you wanted the count of 1s, this would be the way to do it. However, you could get the same results a little more simply like this:
credit.all %>%
group_by(Class) %>%
summarize(Critical = sum(CreditHistory.Critical),
NotCritical = n() - Critical,
PctCritical = Critical / n())
If you really want a pivot approach, we can go that route, it just seems less straightforward. Your data is already in a long format, so we don't need to melt, we can just cast:
pivot = dcast(Class ~ CreditHistory.Critical, data = credit.all)
pivot
# Using CreditHistory.Critical as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# Class 0 1
# 1 Bad 250 50
# 2 Good 457 243
You could then rename the columns and calculate the percentages:
names(pivot)[2:3] = c("NotCritical", "Critical")
pivot$PctCritical = with(pivot, Critical / (Critical + NotCritical)

Simulating a timeseries in dplyr instead of using a for loop

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.
Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))
Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))
The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.
What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Resources