I have a dataset with two columns, in one of them are missing values.
I load it using
data <- read_excel("file.xlsx") %>%
select("ID", "Value")
The tibble looks like that
ID
Value
1
2
NA
4
32
1
The NAs are recognized as such.
However, I use
data["ID"=="NA"] <- NA
to ensure that this is not the problem (R: is.na() does not pick up NA value).
When I try to filter:
data %>%
filter(!is.na(ID))
the whole tibble stays the same, and no row is deleted.
So I try
data %>%
mutate(
isna <- is.na(ID)
)
and all isna are FALSE.
Why doesn't recognize dplyr the NAs?
I am grateful for every help!
data["ID"=="NA"] <- NA
does nothing. The condition "ID"=="NA" is always FALSE, since you are comparing two unequal string literals ("ID" and "NA"). To fix it, use e.g.
data[data$ID == "NA", "ID"] <- NA
Welcome to SO! Use this to get NAs mutated and then delete the NAs:
data <- data %>%
mutate(ID = ifelse(ID == "NA",NA,ID)) %>%
filter(!is.na(ID))
Why not directly
data %>%
filter(ID != "NA")
or
subset(data, ID != "NA")
Related
I am trying to create a column and set it to 1 based on whether all particular columns (with similar name pattern) is NA.
This is what I have tried so far and doesn't seem to work.
Any help would be appreciated thanks!
mutate(
column_to_create =
case_when(
is.na(vars(matches('pattern'))) ~ as.character(1)
)
)
You can try -
library(dplyr)
df <- df %>%
mutate(column_to_create = as.integer(rowSums(!is.na(select(.,
matches('pattern')))) == 0))
This should give 1 when all the values in the column that has 'pattern' in them has NA and 0 otherwise.
I'd like to assign NAs to columns based on their name and another column value.
Like in the following example:
Given dataframe iris, I would like to assign NA to all columns whose name starts with "Sepal" and column "Species" == "setosa"
A solution using dplyr mutate_at/mutate_if is preferable, any other solution is also welcome.
I tried
iris %>%
mutate_if(str_detect(names(.), pattern = "Sepal") & (.$Species == "setosa") , function(x){x <- NA})
Error in tbl_if_vars(.tbl, .p, .env, ..., .include_group_vars = .include_group_vars) :
length(.p) == length(tibble_vars) is not TRUE
In dplyr, select vars that contain "Sepal" and assign NA to those rows where Species is "setosa":
iris %>%
mutate_at(vars(contains("Sepal")), funs(ifelse(Species == "setosa", NA, .)))
Or even shorter:
iris %>%
mutate_at(vars(contains("Sepal")),
funs(na_if(Species, "setosa")))
Here I'm attempting to remove NA values from a tibble :
mc = as_tibble(c("NA" , NA , "ls", "test"))
mc <- filter(mc , is.na == TRUE)
But error is returned :
> mc = as_tibble(c("NA" , NA , "ls", "test"))
> mc <- filter(mc , is.na == TRUE)
Error in filter_impl(.data, quo) :
Evaluation error: comparison (1) is possible only for atomic and list types.
How to remove NA values from this tibble ?
Try:
library(tidyverse)
mc %>%
mutate(value = replace(value, value == "NA", NA)) %>%
drop_na()
Which gives:
# A tibble: 2 x 1
value
<chr>
1 ls
2 test
Second line replaces all "NA" to a real <NA>. Then the third line drops all <NA> values.
If you simply want to remove actual NA values:
library(dplyr)
filter(mc, !is.na(value))
Alternatively (this will check all columns, not just the specified column as above):
na.omit(mc)
If you want to remove both NA values, and values equaling the string "NA":
library(dplyr)
filter(mc, !is.na(value), !value == "NA")
The solutions given by #tyluRp and #danh work perfectly fine.
Just wanted to add another alternative solution with the advantages
simpler code
shorter - for the lazy ones like me :)
See this one-liner:
mc %>% replace(. == "NA", NA) %>% na.omit
Let's say we have a matrix with 3 columns and a 100 rows. Let the column names be a_dem, b_dem and c_blah. And let's imagine that each cell can have a value between 0 and 100.
Is there a way to use select(), filter() and %>% to select only the observations that end with "_dem" and have a value larger than, say, 50?
I would've kinda imagined that it would be along these lines:
dat %>%
select(ends_with("dem")) %>%
filter(>50) %>%
summary()
but that doesn't work, obviously.
So, is there a way to do this kind of selection and filtering, or would I have to resort to something more complicated?
You could do this:
library(dplyr)
set.seed(2)
a_dem <- runif(100,0,100)
b_dem <- runif(100,0,100)
c_blah <- runif(100,0,100)
dat <- data.frame(a_dem, b_dem, c_blah)
newdat1 <- dat %>%
select(ends_with("_dem"))
filtered <- sapply(newdat1, function(x) ifelse(x>50, x, NA))
>head(filtered)
a_dem b_dem
[1,] NA NA
[2,] 70.23740 NA
[3,] 57.33263 98.06000
[4,] NA 82.89221
[5,] 94.38393 NA
[6,] 94.34750 59.59169
And then depending on what you want to do next you could easily just exclude the NA values.
Update:
To do this completely in dplyr you can use the method that was linked to here by #sgp667
newdat2 <- dat %>%
select(ends_with("_dem")) %>%
mutate_each(funs(((function(x){ifelse(x>50, x, NA)})(.))))
> head(newdat2)
a_dem b_dem
1 NA NA
2 70.23740 NA
3 57.33263 98.06000
4 NA 82.89221
5 94.38393 NA
6 94.34750 59.59169
I though of another tidyverse solution:
dat %>%
select(ends_with("_dem")) %>%
map_df(function(x) ifelse(x > 50, x, NA))
I thought of another way:
dat %>%
mutate_each(funs(over=(function(x)x>2)(.)),ends_with("dem")) %>%
mutate(all_true=all(ends_with("over"))) %>%
filter(all_true == TRUE) %>%
select(ends_with("dem"))
This might be very verbose but you can filter through an arbitrary number of columns.
I found our here how you can use a custom formula in mutate_each.
The way this works is mutate_each applies funs() to all columns that match ends_with("dem") criteria, and function applied here is (function(x)x>2)(.) which is an anonymous function(it is exactly what it sounds like just a function that I didn't bother naming).
Syntax for anonymous functions is:
(function(some parameters) some instructions)(values for parameters)
In this case function returns TRUE if x is greater than 2, and value passed as x is .(and . is dat, the reason why this works is because of pipe %>% ).
So the mutate_each line produces additional columns, new columns have "over" at the end of their name.
Next line creates another column(called all_true) that also has TRUE/FALSE value which is TRUE if all columns that end_with("over")have TRUE.
filter simply removes rows which have FALSE in all_true column.
Lastly select includes just the columns that match ends_with("dem")
I am trying to replace some filtered values of a data set. So far, I wrote this lines of code:
df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA)),
where uniq is just a list containing variable names I want to focus on (and group1 and values are column names). This is actually working. However, it only outputs the altered filtered rows and does not replace anything in the data set df. Does anyone have an idea, where my mistake is? Thank you so much! The following code is to reproduce the example:
group1 <- c("A","A","A","B","B","C")
values <- c(0.6,0.3,0.1,0.2,0.8,0.9)
df = data.frame(group1, group2, values)
uniq <- unique(unlist(df$group1))
for (i in 1:length(uniq)){
df <- df %>%
filter(group1 == uniq[i]) %>%
mutate(values = ifelse(sum(values) < 1, 2, NA))
}
What I would like to get is that it leaves all values except the last one since it is one unique group (group1 == C) and 0.9 < 1. So I'd like to get the exact same data frame here except that 0.9 is replaced with NA. Moreover, would it be possible to just use if instead of ifelse?
dplyr won't create a new object unless you use an assignment operator (<-).
Compare
require(dplyr)
data(mtcars)
mtcars %>% filter(cyl == 4)
with
mtcars4 <- mtcars %>% filter(cyl == 4)
mtcars4
The data are the same, but in the second example the filtered data is stored in a new object mtcars4