Selectively Remove Column Values in R Data Frame - r

Example
Suppose in the famous iris data set, I have determined that when Sepal.Length > 5.0, there was an error in my measurement device.
In this contrived example, I would like to keep the Sepal.Length column with its original value, but change the remaining columns to NA if the Sepal.Length > 5.0 for that row.
As an example, this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Would become this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA NA
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA 1.7 NA NA
I could certain do this manually via vectorization. Something along the lines of:
iris$Sepal.Width <- ifelse(iris$Sepal.Length > 5.0, NA, iris$Sepal.Width)
In this approach however, I would need to manually specify every column.
Question
I strongly suspect there is a clever way to tackle this via either purrr or dplyr. Nevertheless, I've gotten myself down a pmap / modify_at rabbit hole. Any suggestions towards elegance would be much appreciated.
Thanks!

library(data.table)
dt <- copy(iris)
setDT(dt)
dt[Sepal.Length > 5.0, (which(!names(dt) == "Sepal.Length")) := NA]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 NA NA NA NA
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 NA NA NA NA
# 147: 6.3 NA NA NA NA
# 148: 6.5 NA NA NA NA
# 149: 6.2 NA NA NA NA
# 150: 5.9 NA NA NA NA

Alternative would be to simply use this (this is only handy if you are interested in all columns, beginning with the second one)
iris[iris$Sepal.Length > 5.0, 2:ncol(iris)] <- NA
# And the output for first six rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA <NA>
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA NA NA <NA>

It sounds like this would work for you
my_clip <- function(x, z) ifelse(z>5, NA, x)
iris %>% mutate_at(vars(-Sepal.Length), my_clip, z=.$Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA NA
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# 6 5.4 NA NA NA NA
We use mutate_at to grab all the column we want to transform and then since you can't reference other columns easily in your mutate_at function, we need to pass in the threshold column as a separate parameter using the .$ syntax.

Since you asked for a purrr example, here goes. Although I prefer the data.table answer already proposed:
library(purrr)
library(tidyr)
iris %>% nest(-Sepal.Length) %>%
mutate(data = ifelse(Sepal.Length > 5.0,
map(data, function(x) x*NA), data)) %>%
unnest

With magrittr you could do this :
library(magrittr)
iris %>% head %>% inset(.$Sepal.Length > 5,-1,NA)
or using base R instead of magrittr (same output, just uglier function :), and you still need magrittr or dplyr for the pipes):
iris %>% head %>% `[<-`(.$Sepal.Length > 5,-1,NA)
-1 is the index of the column you want to keep, negated.
result
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA <NA>
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 NA NA NA <NA>

Related

Convert na_character_ to "NA"

I would like to convert NA_charcater_ to "NA".
Data:
iris_test <- head(iris)
iris_test[c(1,4),c(2,3)] <- NA_real_
iris_test[c(1,2),5] <- NA_character_
iris_test$Species <- as.character(iris_test$Species)
iris_test$NAs <- NA_character_
iris_test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 <NA> <NA>
2 4.9 3.0 1.4 0.2 <NA> <NA>
3 4.7 3.2 1.3 0.2 setosa <NA>
4 4.6 NA NA 0.2 setosa <NA>
5 5.0 3.6 1.4 0.2 setosa <NA>
6 5.4 3.9 1.7 0.4 setosa <NA>
Expected_output:
expected <- iris_test
expected[c(1,2),5] <- "NA"
expected$NAs <- "NA"
expected
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3.0 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I tried the following but it failed miserably:
iris_test[(sapply(iris_test, class)=="character")&is.na(iris_test)] <- "NA"
It is not recommended to convert to "NA". The issue in the code is that class returns a vector of length different than the matrix output of is.na. An option is to subset the columns based on the class and then apply the is.na on the subset and do the assign
i1 <- sapply(iris_test, is.character)
iris_test[i1][is.na(iris_test[i1])] <- "NA"
-output
> str(iris_test)
'data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num NA 3 3.2 NA 3.6 3.9
$ Petal.Length: num NA 1.4 1.3 NA 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : chr "NA" "NA" "setosa" "setosa" ...
$ NAs : chr "NA" "NA" "NA" "NA" ...
We could use replace_na wrapped along with as.character():
library(dplyr)
library(tidyr)
iris_test %>%
mutate(across(everything(), ~replace_na(as.character(.), "NA")))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species NAs
1 5.1 NA NA 0.2 NA NA
2 4.9 3 1.4 0.2 NA NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 NA NA 0.2 setosa NA
5 5 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
I end up with the following solution:
iris_test%>%mutate_if(is_character,replace_na,"NA")

How to slice a dataset into multiple dataset in R

For this example, I'm going to use iris dataset built-in in R.
How can I avoid the copy and pasting of the syntax below to have the same output?
package
library(dplyr)
Input
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
Manual Solution
I have to subset my dataset based on the name of the column names.
I know how to do this "manually" but it would require a lot of copying and pasting on my current dataset.
Sepal <- iris %>% select(contains("Sepal"))
Petal <- iris %>% select(contains("Petal"))
Output
head(Sepal)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
head(Petal)
# Petal.Length Petal.Width
# 1 1.4 0.2
# 2 1.4 0.2
# 3 1.3 0.2
# 4 1.5 0.2
# 5 1.4 0.2
# 6 1.7 0.4
How can I automatize this process? I think I can use the purrr package here. But I couldn't find a way to do it.
You can use
library(tidyverse)
map(set_names(c("Sepal", "Petal")), ~ select(iris, starts_with(.x)))
output (head)
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
An option is also to use split.default on the substring of column names to return a named list of data.frames
library(dplyr)
library(stringr)
head(iris) %>%
select(-Species) %>%
split.default(str_remove(names(.), "\\..*"))
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9

Re-order rows of a R dataframe based on a column/ label

My current dataframe in R has the following dimensions
nrows=605
ncol: 1514
The first column indicates the class/ label and my dataset has only two classes namely: setosa and iris.
test[1:5,]
class id1 id2...
1: setosa 2 4.....
2: setosa 2 5 .....
3: setosa 5 4 .....
4: iris 5 9......
5: iris 7 9 ....
However the dataframe is ordered as of now : ie. Rows 2- row 233 of my dataframe correspond to class setosa and class iris is from 234 until end. I want the dataset to be rearranged so that the samples are mixed up.
The expected output should be in following form:
If I do df[1:10,] ie. 10 lines of dataframe ,I should be able to see samples of both iris and setosa. Any ideas or suggestion on how to do this?
library( tidyverse )
iris[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
df <- iris %>%
group_by( Species ) %>%
mutate( id = row_number() ) %>%
arrange( id ) %>%
select ( -id )
df[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 7 3.2 4.7 1.4 versicolor
# 3 6.3 3.3 6 2.5 virginica
# 4 4.9 3 1.4 0.2 setosa
# 5 6.4 3.2 4.5 1.5 versicolor
# 6 5.8 2.7 5.1 1.9 virginica
# 7 4.7 3.2 1.3 0.2 setosa
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 7.1 3 5.9 2.1 virginica
# 10 4.6 3.1 1.5 0.2 setosa

Automatically generate new variable names using dplyr mutate

I would like to create variable names dynamically while using dplyr; although, I’d be fine with a non-dplyr solution as well.
For Example:
data(iris)
library(dplyr)
iris <- iris %>%
group_by(Species) %>%
mutate(
lag_Sepal.Length = lag(Sepal.Length),
lag_Sepal.Width = lag(Sepal.Width),
lag_Petal.Length = lag(Petal.Length)
) %>%
ungroup
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lag_Sepal.Length lag_Sepal.Width
(dbl) (dbl) (dbl) (dbl) (fctr) (dbl) (dbl)
1 5.1 3.5 1.4 0.2 setosa NA NA
2 4.9 3.0 1.4 0.2 setosa 5.1 3.5
3 4.7 3.2 1.3 0.2 setosa 4.9 3.0
4 4.6 3.1 1.5 0.2 setosa 4.7 3.2
5 5.0 3.6 1.4 0.2 setosa 4.6 3.1
6 5.4 3.9 1.7 0.4 setosa 5.0 3.6
Variables not shown: lag_Petal.Length (dbl)
But, instead of doing this three times, I want to create 100 of these “lag” variables that take in the name: lag_original variable name. I’m trying to figure out how to do this without typing the new variable name 100 times, but I’m coming up short.
I’ve looked into this example and this example elsewhere on SO. They are similar, but I’m not quite able to piece together the specific solution I need. Any help is appreciated!
Edit
Thanks to #BenFasoli for the inspiration. I took his answer and tweaked it just a bit to get the solution I needed.
I also used This RStudio Blog and This SO post. The "lag" in the variable name is trailing instead of leading, but I can live with that.
My final code is posted here in case it’s helpful to anyone else:
lagged <- iris %>%
group_by(Species) %>%
mutate_at(
vars(Sepal.Length:Petal.Length),
funs("lag" = lag)) %>%
ungroup
# A tibble: 6 x 8
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_lag Sepal.Width_lag
<dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa NA NA
2 4.9 3.0 1.4 0.2 setosa 5.1 3.5
3 4.7 3.2 1.3 0.2 setosa 4.9 3.0
4 4.6 3.1 1.5 0.2 setosa 4.7 3.2
5 5.0 3.6 1.4 0.2 setosa 4.6 3.1
6 5.4 3.9 1.7 0.4 setosa 5.0 3.6
# ... with 1 more variables: Petal.Length_lag <dbl>
You can use mutate_all (or mutate_at for specific columns) then prepend lag_ to the column names.
data(iris)
library(dplyr)
lag_iris <- iris %>%
group_by(Species) %>%
mutate_all(funs(lag(.))) %>%
ungroup
colnames(lag_iris) <- paste0('lag_', colnames(lag_iris))
head(lag_iris)
lag_Sepal.Length lag_Sepal.Width lag_Petal.Length lag_Petal.Width lag_Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 NA NA NA NA setosa
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5.0 3.6 1.4 0.2 setosa
Here is a data.table approach. I chose columns with numbers in this case. What you want to do is to choose column names and create new column names in advance. Then, you apply shift(), which works like lag() and lead() in the dplyr package, to each of the columns you chose.
library(data.table)
# Crate a df for this demo.
mydf <- iris
# Choose columns that you want to apply lag() and create new colnames.
cols = names(iris)[sapply(iris, is.numeric)]
anscols = paste("lag_", cols, sep = "")
# Apply shift() to each of the chosen columns.
setDT(mydf)[, (anscols) := shift(.SD, 1, type = "lag"),
.SDcols = cols]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lag_Sepal.Length lag_Sepal.Width
1: 5.1 3.5 1.4 0.2 setosa NA NA
2: 4.9 3.0 1.4 0.2 setosa 5.1 3.5
3: 4.7 3.2 1.3 0.2 setosa 4.9 3.0
4: 4.6 3.1 1.5 0.2 setosa 4.7 3.2
5: 5.0 3.6 1.4 0.2 setosa 4.6 3.1
---
146: 6.7 3.0 5.2 2.3 virginica 6.7 3.3
147: 6.3 2.5 5.0 1.9 virginica 6.7 3.0
148: 6.5 3.0 5.2 2.0 virginica 6.3 2.5
149: 6.2 3.4 5.4 2.3 virginica 6.5 3.0
150: 5.9 3.0 5.1 1.8 virginica 6.2 3.4
lag_Petal.Length lag_Petal.Width
1: NA NA
2: 1.4 0.2
3: 1.4 0.2
4: 1.3 0.2
5: 1.5 0.2
---
146: 5.7 2.5
147: 5.2 2.3
148: 5.0 1.9
149: 5.2 2.0
150: 5.4 2.3
Since you're also happy with a non-dplyr, try this:
lagger <- function(x, n) c(rep(NA,n), head(x,-n) )
iris[paste0("lag_", names(iris) )] <- lapply(iris, lagger, n=1)
head(iris,2)[-(1:5)]
# lag_Sepal.Length lag_Sepal.Width lag_Petal.Length lag_Petal.Width lag_Species
#1 NA NA NA NA NA
#2 5.1 3.5 1.4 0.2 1

How to drop identical columns when combining data frames?

How can I remove identical columns when combining two data frames?
Consider the dummy example below:
data(iris)
iris2 <- iris
iris2[ 2:7, c(1,3,5)] <- NA
Xa <- cbind(iris, iris2)
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa NA 3.0 NA 0.2 <NA>
##3 4.7 3.2 1.3 0.2 setosa NA 3.2 NA 0.2 <NA>
##4 4.6 3.1 1.5 0.2 setosa NA 3.1 NA 0.2 <NA>
##5 5.0 3.6 1.4 0.2 setosa NA 3.6 NA 0.2 <NA>
##6 5.4 3.9 1.7 0.4 setosa NA 3.9 NA 0.4 <NA>
It is very easy to drop columns with the same name:
Xa <- Xa[ , !(duplicated(names(Xa)))]
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa
##3 4.7 3.2 1.3 0.2 setosa
##4 4.6 3.1 1.5 0.2 setosa
##5 5.0 3.6 1.4 0.2 setosa
##6 5.4 3.9 1.7 0.4 setosa
But not all dropped columns have the same contents. How can I drop identical columns (same name and same contents) from a data frame?
The expected result is:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Petal.Length Species
## 1 5.1 3.5 1.4 0.2 setosa 5.1 1.4 setosa
## 2 4.9 3.0 1.4 0.2 setosa NA NA <NA>
## 3 4.7 3.2 1.3 0.2 setosa NA NA <NA>
## 4 4.6 3.1 1.5 0.2 setosa NA NA <NA>
## 5 5.0 3.6 1.4 0.2 setosa NA NA <NA>
## 6 5.4 3.9 1.7 0.4 setosa NA NA <NA>
You could do
Xa[!duplicated.default(Xa)]
# or
Xa[, !duplicated.default(Xa)]
# or, as mentioned by #akrun in a comment
Xa[!duplicated(c(Xa))]
Whichever way, the columns are renamed automatically (as data.frame() usually does) so that there are no longer any dupes among them.
We can't use vanilla duplicated here because it would use duplicated.data.frame, which compares rows to find duplicates, while duplicated.default compares elements of a vector. A data.frame is an vector of (pointers to) column vectors, so that's why duplicated.default works in this case. duplicated(c(Xa)) or duplicated(as.list(Xa)) also work because they change Xa from a data.frame into a vanilla vector.
Based on the accepted answer, I came up with a very simple function for this task:
rm.df.dupl <- function(x){
stopifnot(is.data.frame(x))
x <- x[ , !duplicated.default(x)]
return(x)
}
All you have to do now is:
rm.df.dupl(Xa)

Resources