I would like to convert NA_charcater_ to "NA".
Data:
iris_test <- head(iris)
iris_test[c(1,4),c(2,3)] <- NA_real_
iris_test[c(1,2),5] <- NA_character_
iris_test$Species <- as.character(iris_test$Species)
iris_test$NAs <- NA_character_
iris_test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 <NA> <NA>
2 4.9 3.0 1.4 0.2 <NA> <NA>
3 4.7 3.2 1.3 0.2 setosa <NA>
4 4.6 NA NA 0.2 setosa <NA>
5 5.0 3.6 1.4 0.2 setosa <NA>
6 5.4 3.9 1.7 0.4 setosa <NA>
Expected_output:
expected <- iris_test
expected[c(1,2),5] <- "NA"
expected$NAs <- "NA"
expected
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3.0 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I tried the following but it failed miserably:
iris_test[(sapply(iris_test, class)=="character")&is.na(iris_test)] <- "NA"
It is not recommended to convert to "NA". The issue in the code is that class returns a vector of length different than the matrix output of is.na. An option is to subset the columns based on the class and then apply the is.na on the subset and do the assign
i1 <- sapply(iris_test, is.character)
iris_test[i1][is.na(iris_test[i1])] <- "NA"
-output
> str(iris_test)
'data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num NA 3 3.2 NA 3.6 3.9
$ Petal.Length: num NA 1.4 1.3 NA 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : chr "NA" "NA" "setosa" "setosa" ...
$ NAs : chr "NA" "NA" "NA" "NA" ...
We could use replace_na wrapped along with as.character():
library(dplyr)
library(tidyr)
iris_test %>%
mutate(across(everything(), ~replace_na(as.character(.), "NA")))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I end up with the following solution:
iris_test%>%mutate_if(is_character,replace_na,"NA")
Related
This question already has answers here:
How to add multiple columns to a tibble?
(3 answers)
Using `dplyr::mutate()` to create several new variables from names specified in a vector
(1 answer)
Closed 1 year ago.
I would like to add columns in a dataframe (filled in with NA) from an existing vector.
Here's my example (I want 2 new columns "State" and "Type" with NA)
Many thanks in advance !
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables,names(iris))
# Not working
if (length(difference>0)) {
resultat <- iris %>%
mutate(
for (var_to_add in (difference)) {
!!sym(var_to_add) := NA)
}
)
}
I'm not exactly sure what the result should be, but we can create a named vector of new variables and splice it into mutate with !!!:
library(dplyr)
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables,names(iris))
new_vars <- setNames(rep(NA, length(difference)), difference)
iris %>%
as_tibble() %>% # for printing
mutate(!!! new_vars)
#> # A tibble: 150 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species State Type
#> <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <lgl>
#> 1 5.1 3.5 1.4 0.2 setosa NA NA
#> 2 4.9 3 1.4 0.2 setosa NA NA
#> 3 4.7 3.2 1.3 0.2 setosa NA NA
#> 4 4.6 3.1 1.5 0.2 setosa NA NA
#> 5 5 3.6 1.4 0.2 setosa NA NA
#> 6 5.4 3.9 1.7 0.4 setosa NA NA
#> 7 4.6 3.4 1.4 0.3 setosa NA NA
#> 8 5 3.4 1.5 0.2 setosa NA NA
#> 9 4.4 2.9 1.4 0.2 setosa NA NA
#> 10 4.9 3.1 1.5 0.1 setosa NA NA
#> # … with 140 more rows
Created on 2021-06-09 by the reprex package (v0.3.0)
Alternatively with base r:
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables, names(iris))
iris[, difference] <- NA
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species State Type
#> 1 5.1 3.5 1.4 0.2 setosa NA NA
#> 2 4.9 3.0 1.4 0.2 setosa NA NA
#> 3 4.7 3.2 1.3 0.2 setosa NA NA
#> 4 4.6 3.1 1.5 0.2 setosa NA NA
#> 5 5.0 3.6 1.4 0.2 setosa NA NA
#> 6 5.4 3.9 1.7 0.4 setosa NA NA
Created on 2021-06-09 by the reprex package (v0.3.0)
Is there a way to extract only column names that are factor. For example, in iris dataset, last column is a factor, so only Species (column name and not entire column) should be extracted
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(head(iris))
'data.frame': 6 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
We can use :
names(iris)[sapply(iris, is.factor)]
#[1] "Species"
Or using Filter :
names(Filter(is.factor, iris))
Another solution which involves the dplyr package (if by chance you are already using it in your own project) is
names(iris %>% select_if(is.factor))
or equivalently (choose the one you like more)
iris %>% select_if(is.factor) %>% names()
Output
# [1] "Species"
Example
Suppose in the famous iris data set, I have determined that when Sepal.Length > 5.0, there was an error in my measurement device.
In this contrived example, I would like to keep the Sepal.Length column with its original value, but change the remaining columns to NA if the Sepal.Length > 5.0 for that row.
As an example, this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Would become this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA NA
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA 1.7 NA NA
I could certain do this manually via vectorization. Something along the lines of:
iris$Sepal.Width <- ifelse(iris$Sepal.Length > 5.0, NA, iris$Sepal.Width)
In this approach however, I would need to manually specify every column.
Question
I strongly suspect there is a clever way to tackle this via either purrr or dplyr. Nevertheless, I've gotten myself down a pmap / modify_at rabbit hole. Any suggestions towards elegance would be much appreciated.
Thanks!
library(data.table)
dt <- copy(iris)
setDT(dt)
dt[Sepal.Length > 5.0, (which(!names(dt) == "Sepal.Length")) := NA]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 NA NA NA NA
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 NA NA NA NA
# 147: 6.3 NA NA NA NA
# 148: 6.5 NA NA NA NA
# 149: 6.2 NA NA NA NA
# 150: 5.9 NA NA NA NA
Alternative would be to simply use this (this is only handy if you are interested in all columns, beginning with the second one)
iris[iris$Sepal.Length > 5.0, 2:ncol(iris)] <- NA
# And the output for first six rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA <NA>
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA NA NA <NA>
It sounds like this would work for you
my_clip <- function(x, z) ifelse(z>5, NA, x)
iris %>% mutate_at(vars(-Sepal.Length), my_clip, z=.$Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA NA
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# 6 5.4 NA NA NA NA
We use mutate_at to grab all the column we want to transform and then since you can't reference other columns easily in your mutate_at function, we need to pass in the threshold column as a separate parameter using the .$ syntax.
Since you asked for a purrr example, here goes. Although I prefer the data.table answer already proposed:
library(purrr)
library(tidyr)
iris %>% nest(-Sepal.Length) %>%
mutate(data = ifelse(Sepal.Length > 5.0,
map(data, function(x) x*NA), data)) %>%
unnest
With magrittr you could do this :
library(magrittr)
iris %>% head %>% inset(.$Sepal.Length > 5,-1,NA)
or using base R instead of magrittr (same output, just uglier function :), and you still need magrittr or dplyr for the pipes):
iris %>% head %>% `[<-`(.$Sepal.Length > 5,-1,NA)
-1 is the index of the column you want to keep, negated.
result
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA <NA>
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 NA NA NA <NA>
I am facing the following issues:
I want to replace all NA's of a certain categorical variable with "Unknown", however it does not work.
Here's the code:
x <- "Unknown"
kd$form_of_address[which(is.na(kd$form_of_address))]) <- x
The problem arises when I perform
levels(kd$form_of_address)
Sadly, my output does not include "Unknown".
My data includes ebooks whose weight is always 0. Which code is appropriate to replace NAs of the variable weight that have values of the variable ebook_count with ebook_count > 0 with 0 ?
Thank you in advance :)
I assume your variable is in factor form, which does not let you change its cell if it is not in the level.
Using iris, let's see what may have happened and how it can solved.
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can see Species variable is a factor.
We can put some NAs into it by:
iris[c(11:13),5] = NA
iris[c(11:15), ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
11 5.4 3.7 1.5 0.2 <NA>
12 4.8 3.4 1.6 0.2 <NA>
13 4.8 3.0 1.4 0.1 <NA>
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
Now, if I try to fill those NAs with "Unknown" using your code:
x = "Unknown"
iris$Species[which(is.na(iris$Species))] = x
which will generate:
Warning message: In [<-.factor(*tmp*, which(is.na(iris$Species)),
value = c(1L, : invalid factor level, NA generated
What you can do first is to add a new level to your variable, and then you can do so
levels(iris$Species) = c(levels(iris$Species), "Unknown")
levels(iris$Species)
[1] "setosa" "versicolor" "virginica" "Unknown"
#You can see now Unknown is one of the levels
iris$Species[which(is.na(iris$Species))] = "Unknown"
table(iris$Species)
setosa versicolor virginica Unknown
47 50 50 3
How can I remove identical columns when combining two data frames?
Consider the dummy example below:
data(iris)
iris2 <- iris
iris2[ 2:7, c(1,3,5)] <- NA
Xa <- cbind(iris, iris2)
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa NA 3.0 NA 0.2 <NA>
##3 4.7 3.2 1.3 0.2 setosa NA 3.2 NA 0.2 <NA>
##4 4.6 3.1 1.5 0.2 setosa NA 3.1 NA 0.2 <NA>
##5 5.0 3.6 1.4 0.2 setosa NA 3.6 NA 0.2 <NA>
##6 5.4 3.9 1.7 0.4 setosa NA 3.9 NA 0.4 <NA>
It is very easy to drop columns with the same name:
Xa <- Xa[ , !(duplicated(names(Xa)))]
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa
##3 4.7 3.2 1.3 0.2 setosa
##4 4.6 3.1 1.5 0.2 setosa
##5 5.0 3.6 1.4 0.2 setosa
##6 5.4 3.9 1.7 0.4 setosa
But not all dropped columns have the same contents. How can I drop identical columns (same name and same contents) from a data frame?
The expected result is:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Petal.Length Species
## 1 5.1 3.5 1.4 0.2 setosa 5.1 1.4 setosa
## 2 4.9 3.0 1.4 0.2 setosa NA NA <NA>
## 3 4.7 3.2 1.3 0.2 setosa NA NA <NA>
## 4 4.6 3.1 1.5 0.2 setosa NA NA <NA>
## 5 5.0 3.6 1.4 0.2 setosa NA NA <NA>
## 6 5.4 3.9 1.7 0.4 setosa NA NA <NA>
You could do
Xa[!duplicated.default(Xa)]
# or
Xa[, !duplicated.default(Xa)]
# or, as mentioned by #akrun in a comment
Xa[!duplicated(c(Xa))]
Whichever way, the columns are renamed automatically (as data.frame() usually does) so that there are no longer any dupes among them.
We can't use vanilla duplicated here because it would use duplicated.data.frame, which compares rows to find duplicates, while duplicated.default compares elements of a vector. A data.frame is an vector of (pointers to) column vectors, so that's why duplicated.default works in this case. duplicated(c(Xa)) or duplicated(as.list(Xa)) also work because they change Xa from a data.frame into a vanilla vector.
Based on the accepted answer, I came up with a very simple function for this task:
rm.df.dupl <- function(x){
stopifnot(is.data.frame(x))
x <- x[ , !duplicated.default(x)]
return(x)
}
All you have to do now is:
rm.df.dupl(Xa)