I'm looking for a solution to recode huge sets of variables into NA.
This should work like:
if (df$check1==1) then df$Q1,Q2,Q3....Q100 <-NA
if (df$check2==1) then df$R1,R2,R3....R500 <-NA
I'd like to keep list of variable names to change in separate lists (CSV).
I thought about ifelse or recode but not sure how to apply it to set variables on the output side. In mutate_if we have conditions on target variables..., so I got lost.
We could use mutate with across
library(dplyr)
df <- df %>%
mutate(across(Q1:Q100, ~ replace(.x, check1 == 1, NA)),
across(R1:R500, ~ replace(.x, check2 == 1, NA)))
In base R, we may use
qcols <- paste0("Q", 1:100)
rcols <- paste0("R", 1:500)
df[df$check1 == 1, qcols] <- NA
df[df$check2 == 1, rcols] <- NA
Related
I have 2 scenarios:
One where I would like to define a new variable (called df$x1) depending on whether there are 16 NAs in 16 other different columns. My proposed code would be:
cols <- 1:16
df %>% mutate(x1=ifelse(rowSums(df[cols] ==NA, na.rm = TRUE) ==16) ,'Yes', 'No')))
On the second scenario, I would like to check whether there is at least 1 NA in a list of 12 variables
How would you do that?
Thank you!
Continuing with your 1st approach except NA's are checked with is.na -
cols <- 1:12
df$x1 <- ifelse(rowSums(is.na(df[cols])) > 0, 'Yes', 'No')
1s't Scenario: df$x1 <- ifelse(rowSums(is.na(df[,cols])) == 16, "Yes", "No")
I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?
If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.
rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.
conditions appears to be a nested list. When you use:
validValues <- condition[2]
in your for loop, your result is also a list.
To get the vector of values to use with %in%, you can extract [[ by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply instead of a for loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE
An alternative approach using across and cur_column() (and leaning heavily on severin's solution):
library(tidyverse)
df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
mutate(across(c(age, sex),
c(valid = ~ .x %in% allowed_values[[cur_column()]])
)
)
Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column
Related question: Refering to column names inside dplyr's across()
I suspect that this will be a duplicate, but my efforts to find an answer have failed. Suppose that I have a data frame with columns made entirely of either integers or factors. Some of these columns have factors with many levels and some do not. Suppose that I want to select parts of or otherwise subset the data such that I only get the columns with factors that have less than 10 levels. How can I do this? My first thought was to make a particularly nasty sapply command, but I'm hoping for a better way.
We can use select_if
library(dplyr)
df1 %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
With a reproducible example using iris
data(iris)
iris %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
Or using sapply
i1 <- sapply(df1, function(x) is.factor(x) && nlevels(x) < 10)
df1[i1]
With data.table you can do:
library(data.table)
setDT(df)
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))<10)]
Example:
df <- data.table(x = factor(1:3, levels = 1:5), y = factor(1:3, levels = 1:10))
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))>5)]
y
1: 1
2: 2
3: 3
I'd like to remove rows corresponding to a particular combination of variables from my data frame.
Here's a dummy data :
father<- c(1, 1, 1, 1, 1)
mother<- c(1, 1, 1, NA, NA)
children <- c(NA, NA, 2, 5, 2)
cousins <- c(NA, 5, 1, 1, 4)
dataset <- data.frame(father, mother, children, cousins)
dataset
father mother children cousins
1 1 NA NA
1 1 NA 5
1 1 2 1
1 NA 5 1
1 NA 2 4
I want to filter this row :
father mother children cousins
1 1 NA NA
I can do it with :
test <- dataset %>%
filter(father==1 & mother==1) %>%
filter (is.na(children)) %>%
filter (is.na(cousins))
test
My question :
I have many columns like grand father, uncle1, uncle2, uncle3 and I want to avoid something like that:
filter (is.na(children)) %>%
filter (is.na(cousins)) %>%
filter (is.na(uncle1)) %>%
filter (is.na(uncle2)) %>%
filter (is.na(uncle3))
and so on...
How can I use dplyr to say filter all the column with na (except father==1 & mother==1)
A possible dplyr(0.5.0.9004 <= version < 1.0) solution is:
# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’
dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))
Explanation:
vars(-father, -mother): select all columns except father and mother.
all_vars(is.na(.)): keep rows where is.na is TRUE for all the selected columns.
note: any_vars should be used instead of all_vars if rows where is.na is TRUE for any column are to be kept.
Update (2020-11-28)
As the _at functions and vars have been superseded by the use of across since dplyr 1.0, the following way (or similar) is recommended now:
dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))
See more example of across and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise") in R after installing the latest version of dplyr.
dplyr >= 1.0.4
If you're using dplyr version >= 1.0.4 you really should use if_any or if_all, which specifically combines the results of the predicate function into a single logical vector making it very useful in filter. The syntax is identical to across, but these verbs were added to help fill this need: if_any/if_all.
library(dplyr)
dataset %>%
filter(if_all(-c(father, mother), ~ is.na(.)), if_all(c(father, mother), ~ !is.na(.)))
Here I have written out the variable names, but you can use any tidy selection helper to specify variables (e.g., column ranges by name or location, regular expression matching, substring matching, starts with/ends with, etc.).
Output
father mother children cousins
1 1 1 NA NA
None of the answers seems to be an adaptable solution. I think the intention is not to list all the variables and values to filter the data.
One easy way to achieve this is through merging. If you have all the conditions in df_filter then you can do this:
df_results = df_filter %>% left_join(df_all)
A dplyr solution:
test <- dataset %>%
filter(father==1 & mother==1 & rowSums(is.na(.[,3:4]))==2)
Where '2' is the number of columns that should be NA.
This gives:
> test
father mother children cousins
1 1 1 NA NA
You can apply this logic in base R as well:
dataset[dataset$father==1 & dataset$mother==1 & rowSums(is.na(dataset[,3:4]))==2,]
Here is a base R method using two Reduce functions and [ to subset.
keepers <- Reduce(function(x, y) x == 1 & y == 1, dataset[, 1:2]) &
Reduce(function(x, y) is.na(x) & is.na(y), dataset[, 3:4])
keepers
[1] TRUE FALSE FALSE FALSE FALSE
Each Reduce consecutively takes the variables provided and performs a logical check. The two results are connected with an &. The second argument to the Reduce functions can be adjusted to include whatever variables in the data.frame that you want.
Then use the logical vector to subset
dataset[keepers,]
father mother children cousins
1 1 1 NA NA
This answer builds on #Feng Jiangs answer using the dplyr::left_joint() operation, and is more like a reprex. In addition, it ensures the proper order of columns is restored in case the order of variables in df_filter differs from the order of the variables in the original dataset. Also, the dataset was expanded for a duplicate combination to show these are part of the filtered output (df_out).
library(dplyr)
father<- c(1, 1, 1, 1, 1,1)
mother<- c(1, 1, 1, NA, NA,1)
children <- c(NA, NA, 2, 5, 2,NA)
cousins <- c(NA, 5, 1, 1, 4,NA)
dataset <- data.frame(father, mother, children, cousins)
df_filter <- data.frame( father = 1, mother = 1, children = NA, cousins = NA)
test <- df_filter %>%
left_join(dataset) %>%
relocate(colnames(dataset))
I am writing a function, which needs a check on whether (and which!) column (variable) has all missing values (NA, <NA>). The following is fragment of the function:
test1 <- data.frame (matrix(c(1,2,3,NA,2,3,NA,NA,2), 3,3))
test2 <- data.frame (matrix(c(1,2,3,NA,NA,NA,NA,NA,2), 3,3))
na.test <- function (data) {
if (colSums(!is.na(data) == 0)){
stop ("The some variable in the dataset has all missing value,
remove the column to proceed")
}
}
na.test (test1)
Warning message:
In if (colSums(!is.na(data) == 0)) { :
the condition has length > 1 and only the first element will be used
Q1: Why is the above error and any fixes ?
Q2: Is there any way to find which of columns have all NA, for example output the list (name of variable or column number)?
This is easy enough to with sapply and a small anonymous function:
sapply(test1, function(x)all(is.na(x)))
X1 X2 X3
FALSE FALSE FALSE
sapply(test2, function(x)all(is.na(x)))
X1 X2 X3
FALSE TRUE FALSE
And inside a function:
na.test <- function (x) {
w <- sapply(x, function(x)all(is.na(x)))
if (any(w)) {
stop(paste("All NA in columns", paste(which(w), collapse=", ")))
}
}
na.test(test1)
na.test(test2)
Error in na.test(test2) : All NA in columns 2
In dplyr
ColNums_NotAllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) != nrow(df)))
}
df %>%
select(ColNums_NotAllMissing(.))
example:
x <- data.frame(x = c(NA, NA, NA), y = c(1, 2, NA), z = c(5, 6, 7))
x %>%
select(ColNums_NotAllMissing(.))
or, the other way around
Cols_AllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) == nrow(df)))
}
x %>%
select(-Cols_AllMissing(.))
To find the columns with all values missing
allmisscols <- apply(dataset,2, function(x)all(is.na(x)));
colswithallmiss <-names(allmisscols[allmisscols>0]);
print("the columns with all values missing");
print(colswithallmiss);
This one will generate the column names that are full of NAs:
library(purrr)
df %>% keep(~all(is.na(.x))) %>% names
To test whether columns have all missing values:
apply(test1,2,function(x) {all(is.na(x))})
To get which columns have all missing values:
test1.nona <- test1[ , colSums(is.na(test1)) == 0]
dplyr approach to finding the number of NAs for each column:
df %>%
summarise_all((funs(sum(is.na(.)))))
The following command gives you a nice table with the columns that have NA values:
sapply(dataframe, function(x)all(any(is.na(x))))
It's an improvement for the first answer you got, which doesn't work properly from some cases.
sapply(b,function(X) sum(is.na(X))
This will give you the count of na in each column of the dataset and also will give 0 if there is no na present in the column
Variant dplyr approach:
dataframe %>% select_if(function(x) all(is.na(x))) %>% colnames()