Using `dplyr::na_if` with a probability to create missing data? - r

I'm interested in simulating data with a chance of missing-ness. How can I do this using using dplyr::na_if?
Intuitively I wanted to do something like:
mtcars %>%
mutate(mpg = na_if(mpg, rbinom(n = n(),
1,
prob = .5) == 1))
But I think this is wrong because na_if is really for matching x and y. How do I use na_if to create a probability of missingness?
(edit: Also if there is a better function for creating missing data in the tidyverse please let me know in the comments)

You don't need na_if here, just use if_else. rbinom is overkill also, runif works fine.
mtcars %>%
mutate(mpg = if_else(runif(n = n()) > 0.5, NA_real_, mpg))

With a slight modification of your code:
mtcars %>%
mutate(mpg = if_else(rbinom(n(), 1, prob = 0.5) == 1, NA_real_, mpg))
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 NA 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 NA 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 NA 6 167.6 123 3.92 3.440 18.30 1 0 4 4

Related

Using dplyr, how should I create a column of strings repeating a character based on the value of another column?

With mtcars for example, I'd like to create a new column carb_dots such that when carb = 4, carb_dots = "...."
Using dplyr, I've tried
library(dplyr)
mtcars2 <- mtcars %>% mutate(carb_dots = rep(".", carb))
This errors with
Error in mutate_impl(.data, dots) :
Evaluation error: invalid 'times' argument.
What should I do? Thanks for your suggestions.
With the addition of stringr, you can do:
mtcars %>%
mutate(carb_dots = str_dup(".", carb))
mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ..
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ..
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ....
We can use strrep
library(dplyr)
mtcars %>%
mutate(carb_dots = strrep(".", carb))
# mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
#Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
#Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
#Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
#...
If we need to use rep
mtcars %>%
rowwise %>%
mutate(carb_dots = paste(rep(".", carb), collapse=""))

mutate with case_when - multiple LHS/RHS OR evaluations

I'm not sure of the best way to ask this question.
I would like to mutate using case_when (or if_else if that works better) to examine if a value exists in any of a range of columns.
E.g. in mtcars I would like to check if any of the columns vs, am, gear or carb contained 1 or 2 and set a new variable newVar to 1 if they do. I could do the following:
mtcars %>%
mutate(newVar = case_when(vs %in% c(1, 2) | am %in% c(1, 2) | gear %in% c(1, 2) | carb %in% c(1, 2) ~ 1,
TRUE ~ 0))
Is there a prettier way to do this? I want to check across 10+ columns so it gets long. Something like:
mtcars %>%
mutate(newVar = case_when(c(vs, am, gear, carb) %in% c(1, 2) ~ 1,
TRUE ~ 0))
I think base R can work good here. Select columns for which you want to check and take row wise sum of logical vector to calculate newVar.
df <- mtcars
cols <- c("vs", "am", "gear", "carb")
df$newVar <- +(rowSums(df[cols] == 1 | df[cols] == 2) > 0)
df
# mpg cyl disp hp drat wt qsec vs am gear carb newVar
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1
#Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0
#Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1
#Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1
#Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1
#....
We can also use apply for row-wise manipulation
df$newVar <- +(apply(df[cols] == 1 | df[cols] == 2, 1, any))
We can use tidyverse option to create the column
library(dplyr)
library(purrr)
mtcars %>%
mutate(newVar = select(., vs:carb) %>%
map(~ .x %in% 1:2) %>%
reduce(`|`) %>%
as.integer)
#. mpg cyl disp hp drat wt qsec vs am gear carb newVar
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0
#8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1
# ...
Or with base R
nm1 <- c("vs", "am", "gear", "carb")
mtcars$newVar <- +(Reduce(`|`, lapply(mtcars[nm1], `%in%`, 1:2)))

Groupwise mathematical operation with iteration over sequence that has not the same length

I would like to create a new variable that is the addition of carb and the ith element of sequ based on cyl.
I think it might be some group_by operation, but I can't figure out how to iterate through sequ.
test_dataset <- mtcars[1:10,]
sequ <- seq(0.5, 0.7, 0.1)
arrange(test_dataset, cyl)
The resulting variable would be
c(1.5, 2.5, 2.5, 4.6, 4.6, 1.6, 1.6, 4.6, 2.7, 4.7)
If you want to create the data as a new column, you can do it like this:
library(dplyr)
test_dataset <- mtcars[1:10,]
sequ <- seq(0.5, 0.7, 0.1)
arrange(test_dataset, cyl) %>%
mutate(x = carb + sequ[match(cyl, unique(cyl))])
# mpg cyl disp hp drat wt qsec vs am gear carb x
# 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.5
# 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2.5
# 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 2.5
# 4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4.6
# 5 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4.6
# 6 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.6
# 7 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.6
# 8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4.6
# 9 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.7
# 10 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 4.7
So here we use match to get the right element of sequ.
You may use within, convert cyl as.factor, as.numeric, and use value to extract from sequ.
sequ <- seq(0.5, 0.7, 0.1)
within(mtcars[1:10,][order(mtcars[1:10,]$cyl), ], {
new=carb + sequ[as.numeric(as.factor(cyl))]})
# mpg cyl disp hp drat wt qsec vs am gear carb new
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2.5
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 2.5
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4.6
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4.6
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.6
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.6
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4.6
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.7
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 4.7

Adding row in dplyr across a selected number of columns

While within dplyr workflow I would like to append a row across a selected number of columns.
Desired results
Starting with the mtcarsdata and applying function(s) with the goal of adding string "A" to columns 2:5 the one should arrive at the following results:
mpg cyl disp hp drat wt qsec vs am gear carb
NA A A A A NA NA NA NA NA NA
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
The following criteria were met:
For the columns with available index in vars() call the "A" string was added
For the remaining columns the NA value was provided
Approach
require(dplyr)
mtcars %>%
mutate_at(.cols = vars(2:5),
.funs = add_row(. = "A", .before = 1))
Naturally, this results in an error message:
Error: Unsupported index type: NULL
Hence my question: how can I utilise add_row, or a similar approach, to force value across a set of columns initially passed via vars()?
Side notes
I don't mind doing this via rbind but I would like to keep my %>% workflow:
%>% - receive object
Add something across first row to columns x:y %>%
Add something across first row to columns m:n %>%
Other manipulations
Add the row then update:
mtcars %>%
head %>%
add_row(.before = 1) %>%
mutate_at(.cols = vars(2:5),
funs(ifelse(is.na(.), "A", .)))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 NA A A A A NA NA NA NA NA NA
# 2 21.0 6 160 110 3.9 2.620 16.46 0 1 4 4
# 3 21.0 6 160 110 3.9 2.875 17.02 0 1 4 4
# 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 5 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# 6 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# 7 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: This will add "A" to any row that has NAs.

How to have NA's displayed first using arrange()

Sample data:
temp = data.frame(col = list(NA, 1, 2, 3) )
Using arrange:
temp %>%
arrange(col)
gives
col
1 1
2 2
3 3
4 NA
and
temp %>%
arrange(desc(col))
gives
col
1 3
2 2
3 1
4 NA
I would like
col
1 NA
2 3
3 2
4 1
that is, to put NAs first. Does anyone know how to do this?
You could also do:
m %>%
arrange(!is.na(wt), wt) ##Spacedman's dataset
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
#2 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#5 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#7 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#9 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#10 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Write a function that sorts a data frame and then pass the handy na.last=FALSE option to order. My original version can be found in the edit history, David Arenburg improved it to this:
> sortNA=function(d,n,...){d[order(d[[deparse(substitute(n))]],...),]}
Then use like this
> m=mtcars[1:10,]
> m$wt[5]=NA
> m$wt[8]=NA
> m %.% sortNA(wt, na.last=FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 NA 17.02 0 0 3 2
Merc 240D 24.4 4 146.7 62 3.69 NA 20.00 1 0 4 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Add decreasing=TRUE to sort in the opposite order.
You might also consider posting an issue to the dplyr github issue tracker to suggest a new option to the arrange function to do this.
The order function in base R has an na.last argument:
> temp=data.frame(col=c(NA,1,2,3))
> temp[order(temp[,"col"],na.last=F),]
[1] NA 1 2 3

Resources