I have to calculate the number of missing values per observation in a data set. As there are several variables across multiple time periods, I thought it best to try a function to keep my syntax clean. The first part of looking up the number of missing values works fine:
data$NMISS <- data %>%
select('x1':'x4') %>%
apply(1, function(x) sum(is.na(x)))
But when I try turn it into a function I get "Error in select():! NA/NaN argument"
library(dplyr)
library(tidyverse)
data <- data.frame(x1 = c(NA, 1, 5, 1),
x2 = c(7, 1, 1, 5),
x3 = c(9, NA, 4, 9),
x4 = c(3, 4, 1, 2))
NMISSfunc <- function (dataFrame,variables) {
dataFrame %>% select(variables) %>%
apply(1, function(x) sum(is.na(x)))
}
data$NMISS2 <- NMISSfunc(data,'x1':'x4')
I think it doesn't like the : in the range as it will accept c('x1','x2','x3','x4') instead of 'x1':'x4'
Some of the ranges are over twenty columns so listing them doesn't really provide a solution to keep the syntax neat.
Any suggestions?
You are right that you can't use "x4":"x4", as this isn't valid use of the : operator in this context. To get this to work in a tidyverse-style, your variables variable needs to be selectively unquoted inside select. Fortunately, the tidyverse has the curly-curly notation {{variables}} for handling exactly this situation:
NMISSfunc <- function (dataFrame, variables) {
dataFrame %>%
select({{variables}}) %>%
apply(1, function(x) sum(is.na(x)))
}
Now we can use x1:x4 (without quotes) and the function works as expected:
NMISSfunc(data, x1:x4)
#> [1] 1 1 0 0
Created on 2022-12-13 with reprex v2.0.2
Why not simply,
data %>%
mutate(NMISS = rowSums(is.na(select(., x1:x4))))
x1 x2 x3 x4 NMISS
1 NA 7 9 3 1
2 1 1 NA 4 1
3 5 1 4 1 0
4 1 5 9 2 0
Related
I am building a tidy-compatible function for use inside dplyr's mutate where I'd like to pass a variable and also the data set I'm working with, and use information from both to build a vector.
As a basic example, imagine I want to return a string containing the mean of the variable and the number of rows in the data set (I know I could just take the length of var, ignore that, it's an example).
library(tidyverse)
library(rlang)
info <- function(var,df = get(".",envir = parent.frame())) {
paste(mean(var),nrow(df),sep=', ')
}
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
#Works fine, 'types' contains '5.5, 10'
dat %>% mutate(types = info(a))
Ok, great so far. But now maybe I want it to work with grouped data. var will be from just one group, but . would be the full data set. So instead I'll use rlang's .data pronoun, which is just the data being worked with.
However, .data is not like .. . is the data set, but .data is just a pronoun from which I can pull variables with .data[[varname]].
info2 <- function(var,df = get(".data",envir = parent.frame())) {
paste(mean(var),nrow(.data),sep=', ')
}
#Doesn't work. nrow(.data) gives blank strings
dat %>% group_by(i) %>% mutate(types = info2(a))
How can I get the full thing from .data? I know I didn't include it in the example but specifically I both need some stuff from attr(dat) AND some stuff from the variables in dat that is properly subsetted for the grouping, so neither reverting to . nor just pulling out variables and getting stuff from there would work.
As Alexis mentioned in the above comment, this is not possible, as it's not the intended use of .data. However, now that I've given up on doing this directly, I've worked up a kludge using a combination of . and .data.
info <- function(var,df = get(".",envir = parent.frame())) {
#First, get any information you need from .
fulldatasize <- nrow(df)
#Then, check if you actually need .data,
#i.e. the data is grouped and you need a subsample
if (length(var) < nrow(df)) {
#If you are, get the list of variables you want from .data, maybe all of them
namesiwant <- names(df)
#Get .data
datapronoun <- get('.data',envir=parent.frame())
#And remake df using just the subsample
df <- data.frame(lapply(namesiwant, function(x) datapronoun[[x]]))
names(df) <- namesiwant
}
#Now do whatever you want with the .data data
groupsize <- nrow(df)
paste(mean(var),groupsize,fulldatasize,sep=', ')
}
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
#types contains the within-group mean, then 5, then 10
dat %>% group_by(i) %>% mutate(types = info(a))
Why not use length() instead of nrow() here ?
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
info <- function(var) {
paste(mean(var),length(var),sep=', ')
}
dat %>% group_by(i) %>% mutate(types = info(a))
#> # A tibble: 10 x 3
#> # Groups: i [2]
#> a i types
#> <int> <dbl> <chr>
#> 1 1 1 3, 5
#> 2 2 1 3, 5
#> 3 3 1 3, 5
#> 4 4 1 3, 5
#> 5 5 1 3, 5
#> 6 6 2 8, 5
#> 7 7 2 8, 5
#> 8 8 2 8, 5
#> 9 9 2 8, 5
#> 10 10 2 8, 5
I am trying to run rcorr as part of a function over multiple dataframes, extracting p-values for each test but am receiving an NA values when piping into rcorr.
For example if I create a matrix and run rcorr on this matrix, extracting the pvalue table with $P and the pvalue with [2] it works...
library(Hmisc)
library(magrittr)
mt <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), ncol=2)
rcorr(mt, type="pearson")$P[2]
[1] 0
But if I try and pipe this I only recieve NAs.
mt %>% rcorr(., type="pearson")$P[2]
[1] NA NA
mt %>% rcorr(., type="pearson")$P
Error in .$rcorr(., type = "pearson") :
3 arguments passed to '$' which requires 2
Can someone explain to me why this doesnt work or give a workaround? Ideally I don't want to have to create variables for each of my matrices before running rcorr
Thanks in advance.
Solution
(mt %>% mcor(type = "pearson"))$P[2]
# [1] 0
Explanation
Notice that both
mt %>% rcorr(., type = "pearson")
and
mt %>% rcorr(type = "pearson")
work as expected. The problem is that you add $ and [ to the second object, which basically are like subsequent function calls. For instance,
s <- function(x) c(1, 1 + x)
1 %>% s
# [1] 1 2
works as expected, but
1 %>% s[1]
# Error in .[s, 1] : incorrect number of dimensions
doesn't return 1 since we are trying to do something like s[1](1) instead.
Now
1 %>% s(x = .)[1]
# Error in .[s(x = .), 1] : incorrect number of dimensions
just as yours
mt %>% rcorr(., type = "pearson")$P[2]
# [1] NA NA
is trickier. Notice that it can be rewritten as
mt %>% `[`(`$`(rcorr(., type = "pearson"), "P"), 2)
# [1] NA NA
So, now it becomes clear that the latter doesn't work because it basically is
`[`(mt, `$`(rcorr(mt, type = "pearson"), "P"), 2)
# [1] NA NA
which, when deciphered, is
mt[rcorr(mt, type = "pearson")$P, 2]
# [1] NA NA
A tidy solution, at least I hope!
library(dplyr)
library(broom)
library(Hmisc)
mtcars[, 5:6] %>%
as.matrix()%>%
rcorr()%>%
tidy() %>%
select(estimate)
A simple solution using %$% from magrittr:
library(Hmisc)
library(magrittr)
mt <- matrix(1:10, ncol=2)
mt %>% rcorr(type="pearson") %$% P[2]
[1] 0
My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0
Using dplyr, I am trying to conditionally update values in a column using ifelse and mutate. I am trying to say that, in a data frame, if any variable (column) in a row is equal to 7, then variable c should become 100, otherwise c remains the same.
df <- data.frame(a = c(1,2,3),
b = c(1,7,3),
c = c(5,2,9))
df <- df %>% mutate(c = ifelse(any(vars(everything()) == 7), 100, c))
This gives me the error:
Error in mutate_impl(.data, dots) :
Evaluation error: (list) object cannot be coerced to type 'double'.
The output I'd like is:
a b c
1 1 1 5
2 2 7 100
3 3 3 9
Note: this is an abstract example of a larger data set with more rows and columns.
EDIT:
This code gets me a bit closer, but it does not apply the ifelse statement by each row. Instead, it is changing all values to 100 in column c if 7 is present anywhere in the data frame.
df <- df %>% mutate(c = ifelse(any(select(., everything()) == 7), 100, c))
a b c
1 1 1 100
2 2 7 100
3 3 3 100
Perhaps this is not possible to do using dplyr?
I think this should work. We can check if values in df equal to 7. After that, use rowSums to see if any rows larger than 0, which means there is at least one value is 7.
df <- df %>% mutate(c = ifelse(rowSums(df == 7) > 0, 100, c))
Or we can use apply
df <- df %>% mutate(c = ifelse(apply(df == 7, 1, any), 100, c))
A base R equivalent is like this.
df$c[apply(df == 7, 1, any)] <- 100
You could try with purrr::map_dbl
library(purrr)
df$c <- map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c))
Output
a b c
1 1 1 5
2 2 7 100
3 3 3 9
In a dplyr::mutate statement this would be
library(purrr)
library(dplyr)
df %>%
mutate(c = map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c)))
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.