This question already has answers here:
How to add multiple columns to a tibble?
(3 answers)
Using `dplyr::mutate()` to create several new variables from names specified in a vector
(1 answer)
Closed 1 year ago.
I would like to add columns in a dataframe (filled in with NA) from an existing vector.
Here's my example (I want 2 new columns "State" and "Type" with NA)
Many thanks in advance !
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables,names(iris))
# Not working
if (length(difference>0)) {
resultat <- iris %>%
mutate(
for (var_to_add in (difference)) {
!!sym(var_to_add) := NA)
}
)
}
I'm not exactly sure what the result should be, but we can create a named vector of new variables and splice it into mutate with !!!:
library(dplyr)
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables,names(iris))
new_vars <- setNames(rep(NA, length(difference)), difference)
iris %>%
as_tibble() %>% # for printing
mutate(!!! new_vars)
#> # A tibble: 150 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species State Type
#> <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <lgl>
#> 1 5.1 3.5 1.4 0.2 setosa NA NA
#> 2 4.9 3 1.4 0.2 setosa NA NA
#> 3 4.7 3.2 1.3 0.2 setosa NA NA
#> 4 4.6 3.1 1.5 0.2 setosa NA NA
#> 5 5 3.6 1.4 0.2 setosa NA NA
#> 6 5.4 3.9 1.7 0.4 setosa NA NA
#> 7 4.6 3.4 1.4 0.3 setosa NA NA
#> 8 5 3.4 1.5 0.2 setosa NA NA
#> 9 4.4 2.9 1.4 0.2 setosa NA NA
#> 10 4.9 3.1 1.5 0.1 setosa NA NA
#> # … with 140 more rows
Created on 2021-06-09 by the reprex package (v0.3.0)
Alternatively with base r:
names_variables <- c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width",
"Species",
"State",
"Type")
difference <- setdiff(names_variables, names(iris))
iris[, difference] <- NA
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species State Type
#> 1 5.1 3.5 1.4 0.2 setosa NA NA
#> 2 4.9 3.0 1.4 0.2 setosa NA NA
#> 3 4.7 3.2 1.3 0.2 setosa NA NA
#> 4 4.6 3.1 1.5 0.2 setosa NA NA
#> 5 5.0 3.6 1.4 0.2 setosa NA NA
#> 6 5.4 3.9 1.7 0.4 setosa NA NA
Created on 2021-06-09 by the reprex package (v0.3.0)
Related
the question is simple, but I would like to know if there is an elegant way to achieve this goal. I would like to filter all the rows with a NA value in any variable but an arbitrary one.
Something like this:
data %>% filter(complete.cases(-var1))
does anyone know the answer using dplyr? I could list all of them but in a dataset with lots of variables this is impossible...
Thanks!
You can do
data %>% filter(complete.cases(select(., -var1)))
Which does the job, as this reprex demonstrates.
First, create a dummy data set where the whole first column is NA
library(dplyr)
data <- setNames(iris[1:10,], paste0('var', 1:5))
data$var1 <- NA
data
#> var1 var2 var3 var4 var5
#> 1 NA 3.5 1.4 0.2 setosa
#> 2 NA 3.0 1.4 0.2 setosa
#> 3 NA 3.2 1.3 0.2 setosa
#> 4 NA 3.1 1.5 0.2 setosa
#> 5 NA 3.6 1.4 0.2 setosa
#> 6 NA 3.9 1.7 0.4 setosa
#> 7 NA 3.4 1.4 0.3 setosa
#> 8 NA 3.4 1.5 0.2 setosa
#> 9 NA 2.9 1.4 0.2 setosa
#> 10 NA 3.1 1.5 0.1 setosa
Note that filtering this by complete.cases returns an empty data frame:
data %>% filter(complete.cases(.))
#> [1] var1 var2 var3 var4 var5
#> <0 rows> (or 0-length row.names)
But we can exclude var1 from complete.cases like this:
data %>% filter(complete.cases(select(., -var1)))
#> var1 var2 var3 var4 var5
#> 1 NA 3.5 1.4 0.2 setosa
#> 2 NA 3.0 1.4 0.2 setosa
#> 3 NA 3.2 1.3 0.2 setosa
#> 4 NA 3.1 1.5 0.2 setosa
#> 5 NA 3.6 1.4 0.2 setosa
#> 6 NA 3.9 1.7 0.4 setosa
#> 7 NA 3.4 1.4 0.3 setosa
#> 8 NA 3.4 1.5 0.2 setosa
#> 9 NA 2.9 1.4 0.2 setosa
#> 10 NA 3.1 1.5 0.1 setosa
Created on 2023-01-17 with reprex v2.0.2
I would like to convert NA_charcater_ to "NA".
Data:
iris_test <- head(iris)
iris_test[c(1,4),c(2,3)] <- NA_real_
iris_test[c(1,2),5] <- NA_character_
iris_test$Species <- as.character(iris_test$Species)
iris_test$NAs <- NA_character_
iris_test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 <NA> <NA>
2 4.9 3.0 1.4 0.2 <NA> <NA>
3 4.7 3.2 1.3 0.2 setosa <NA>
4 4.6 NA NA 0.2 setosa <NA>
5 5.0 3.6 1.4 0.2 setosa <NA>
6 5.4 3.9 1.7 0.4 setosa <NA>
Expected_output:
expected <- iris_test
expected[c(1,2),5] <- "NA"
expected$NAs <- "NA"
expected
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3.0 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I tried the following but it failed miserably:
iris_test[(sapply(iris_test, class)=="character")&is.na(iris_test)] <- "NA"
It is not recommended to convert to "NA". The issue in the code is that class returns a vector of length different than the matrix output of is.na. An option is to subset the columns based on the class and then apply the is.na on the subset and do the assign
i1 <- sapply(iris_test, is.character)
iris_test[i1][is.na(iris_test[i1])] <- "NA"
-output
> str(iris_test)
'data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num NA 3 3.2 NA 3.6 3.9
$ Petal.Length: num NA 1.4 1.3 NA 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : chr "NA" "NA" "setosa" "setosa" ...
$ NAs : chr "NA" "NA" "NA" "NA" ...
We could use replace_na wrapped along with as.character():
library(dplyr)
library(tidyr)
iris_test %>%
mutate(across(everything(), ~replace_na(as.character(.), "NA")))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I end up with the following solution:
iris_test%>%mutate_if(is_character,replace_na,"NA")
Below is the code i used to do a mode imputation for the column status group of the dataset tan1.
How do I rewrite the same using pipes? the unique() function does not seem to work in pipes.
NA_stat <- unique(tan1$status_group[!is.na(tan1$status_group)])
mode <- NA_stat[which.max(tabulate(match(tan1$status_group, NA_stat)))]
tan1$status_group[is.na(tan1$status_group)] <- mode
Also, how do I apply this same process for multiple columns?
Here are some examples of determining and imputing the mode in a pipe.
Functions to calculate mode:
library(tidyverse)
# Single mode (returns only the first mode if there's more than one)
# https://stackoverflow.com/a/8189441/496488
# Modified to remove NA
Mode <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
# Return all modes if there's more than one
# https://stackoverflow.com/a/8189441/496488
# Modified to remove NA
Modes <- function(x) {
ux <- na.omit(unique(x))
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
Apply the functions to a data frame:
iris %>%
summarise(across(everything(), Mode))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5 3 1.4 0.2 setosa
iris %>% map(Modes)
#> $Sepal.Length
#> [1] 5
#>
#> $Sepal.Width
#> [1] 3
#>
#> $Petal.Length
#> [1] 1.4 1.5
#>
#> $Petal.Width
#> [1] 0.2
#>
#> $Species
#> [1] setosa versicolor virginica
#> Levels: setosa versicolor virginica
Impute missing data using the mode. But note that we use Mode, which returns only the first mode in cases where there are multiple modes. You may need to adjust your method if you have multiple modes.
# Create missing data
d = iris
d[1, ] = rep(NA, ncol(iris))
head(d)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 NA NA NA NA <NA>
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Replace missing values with the mode
d = d %>%
mutate(across(everything(), ~coalesce(., Mode(.))))
head(d)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.0 3.0 1.5 0.2 versicolor
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Using the Iris dataframe I can pretty easily pull the first n = 100 records with:
m_data<-iris
m_data[1:100,]
But I am also interested in pulling the first 100 records based on a nice split of the Species. Assume for the moment that the first 100 records are all the same species - I would like to pull the data with a "first sampling" based on the varying Species instead.
Any suggestions are welcome. Thank you.
You can also do this with dplyr, here selecting the first 10 from each species:
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number() <= 10) # or slice(1:10)
#> # A tibble: 30 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 20 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).
Here's an alternative:
do.call(rbind, lapply(split(iris, iris$Species), head, 100))
This pulls the first 100 records from iris by Species
You can use by instead of lapply
do.call(rbind, by(iris, iris$Species, head, 100))
Example
Suppose in the famous iris data set, I have determined that when Sepal.Length > 5.0, there was an error in my measurement device.
In this contrived example, I would like to keep the Sepal.Length column with its original value, but change the remaining columns to NA if the Sepal.Length > 5.0 for that row.
As an example, this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Would become this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA NA
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA 1.7 NA NA
I could certain do this manually via vectorization. Something along the lines of:
iris$Sepal.Width <- ifelse(iris$Sepal.Length > 5.0, NA, iris$Sepal.Width)
In this approach however, I would need to manually specify every column.
Question
I strongly suspect there is a clever way to tackle this via either purrr or dplyr. Nevertheless, I've gotten myself down a pmap / modify_at rabbit hole. Any suggestions towards elegance would be much appreciated.
Thanks!
library(data.table)
dt <- copy(iris)
setDT(dt)
dt[Sepal.Length > 5.0, (which(!names(dt) == "Sepal.Length")) := NA]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 NA NA NA NA
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 NA NA NA NA
# 147: 6.3 NA NA NA NA
# 148: 6.5 NA NA NA NA
# 149: 6.2 NA NA NA NA
# 150: 5.9 NA NA NA NA
Alternative would be to simply use this (this is only handy if you are interested in all columns, beginning with the second one)
iris[iris$Sepal.Length > 5.0, 2:ncol(iris)] <- NA
# And the output for first six rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA <NA>
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA NA NA <NA>
It sounds like this would work for you
my_clip <- function(x, z) ifelse(z>5, NA, x)
iris %>% mutate_at(vars(-Sepal.Length), my_clip, z=.$Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA NA
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# 6 5.4 NA NA NA NA
We use mutate_at to grab all the column we want to transform and then since you can't reference other columns easily in your mutate_at function, we need to pass in the threshold column as a separate parameter using the .$ syntax.
Since you asked for a purrr example, here goes. Although I prefer the data.table answer already proposed:
library(purrr)
library(tidyr)
iris %>% nest(-Sepal.Length) %>%
mutate(data = ifelse(Sepal.Length > 5.0,
map(data, function(x) x*NA), data)) %>%
unnest
With magrittr you could do this :
library(magrittr)
iris %>% head %>% inset(.$Sepal.Length > 5,-1,NA)
or using base R instead of magrittr (same output, just uglier function :), and you still need magrittr or dplyr for the pipes):
iris %>% head %>% `[<-`(.$Sepal.Length > 5,-1,NA)
-1 is the index of the column you want to keep, negated.
result
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA <NA>
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 NA NA NA <NA>