How can I remove identical columns when combining two data frames?
Consider the dummy example below:
data(iris)
iris2 <- iris
iris2[ 2:7, c(1,3,5)] <- NA
Xa <- cbind(iris, iris2)
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa NA 3.0 NA 0.2 <NA>
##3 4.7 3.2 1.3 0.2 setosa NA 3.2 NA 0.2 <NA>
##4 4.6 3.1 1.5 0.2 setosa NA 3.1 NA 0.2 <NA>
##5 5.0 3.6 1.4 0.2 setosa NA 3.6 NA 0.2 <NA>
##6 5.4 3.9 1.7 0.4 setosa NA 3.9 NA 0.4 <NA>
It is very easy to drop columns with the same name:
Xa <- Xa[ , !(duplicated(names(Xa)))]
head(Xa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##1 5.1 3.5 1.4 0.2 setosa
##2 4.9 3.0 1.4 0.2 setosa
##3 4.7 3.2 1.3 0.2 setosa
##4 4.6 3.1 1.5 0.2 setosa
##5 5.0 3.6 1.4 0.2 setosa
##6 5.4 3.9 1.7 0.4 setosa
But not all dropped columns have the same contents. How can I drop identical columns (same name and same contents) from a data frame?
The expected result is:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length Petal.Length Species
## 1 5.1 3.5 1.4 0.2 setosa 5.1 1.4 setosa
## 2 4.9 3.0 1.4 0.2 setosa NA NA <NA>
## 3 4.7 3.2 1.3 0.2 setosa NA NA <NA>
## 4 4.6 3.1 1.5 0.2 setosa NA NA <NA>
## 5 5.0 3.6 1.4 0.2 setosa NA NA <NA>
## 6 5.4 3.9 1.7 0.4 setosa NA NA <NA>
You could do
Xa[!duplicated.default(Xa)]
# or
Xa[, !duplicated.default(Xa)]
# or, as mentioned by #akrun in a comment
Xa[!duplicated(c(Xa))]
Whichever way, the columns are renamed automatically (as data.frame() usually does) so that there are no longer any dupes among them.
We can't use vanilla duplicated here because it would use duplicated.data.frame, which compares rows to find duplicates, while duplicated.default compares elements of a vector. A data.frame is an vector of (pointers to) column vectors, so that's why duplicated.default works in this case. duplicated(c(Xa)) or duplicated(as.list(Xa)) also work because they change Xa from a data.frame into a vanilla vector.
Based on the accepted answer, I came up with a very simple function for this task:
rm.df.dupl <- function(x){
stopifnot(is.data.frame(x))
x <- x[ , !duplicated.default(x)]
return(x)
}
All you have to do now is:
rm.df.dupl(Xa)
Related
For this example, I'm going to use iris dataset built-in in R.
How can I avoid the copy and pasting of the syntax below to have the same output?
package
library(dplyr)
Input
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
Manual Solution
I have to subset my dataset based on the name of the column names.
I know how to do this "manually" but it would require a lot of copying and pasting on my current dataset.
Sepal <- iris %>% select(contains("Sepal"))
Petal <- iris %>% select(contains("Petal"))
Output
head(Sepal)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
head(Petal)
# Petal.Length Petal.Width
# 1 1.4 0.2
# 2 1.4 0.2
# 3 1.3 0.2
# 4 1.5 0.2
# 5 1.4 0.2
# 6 1.7 0.4
How can I automatize this process? I think I can use the purrr package here. But I couldn't find a way to do it.
You can use
library(tidyverse)
map(set_names(c("Sepal", "Petal")), ~ select(iris, starts_with(.x)))
output (head)
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
An option is also to use split.default on the substring of column names to return a named list of data.frames
library(dplyr)
library(stringr)
head(iris) %>%
select(-Species) %>%
split.default(str_remove(names(.), "\\..*"))
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Sample df:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
In the testlag column, I'm interesting in using dplyr::lag() to retrieve the previous value and add some column, for example Petal.Length to it. As I have only one initial value, each subsequent calculation requires it to work iteratively, so I thought something like mutate would work.
I first tried doing something like this:
iris %>% mutate_at("testlag", ~ lag(.) + Petal.Length)
But this removed the first value, and only gave a valid value for the second row and NAs for the rest. Intuitively I know why it's removing the first value, but I thought the nature of mutate would allow it to work for the rest of the values, so I don't know how to fix that.
Of course using base R I could something like:
for (idx in 2:nrow(iris)) {
iris[[idx, "testlag"]] <-
lag(iris$testlag)[idx] + iris[[idx, "Petal.Length"]]
}
But I would prefer to implement this in tidyverse syntax.
Edit: Desired output (from my for loop)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5.0
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.7
4 4.6 3.1 1.5 0.2 setosa 9.2
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.3
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.2
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.1
Does this work for you?
library(tidyverse)
library("data.table")
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length)))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.8
4 4.6 3.1 1.5 0.2 setosa 9.1
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.0
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.1
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.0
Since technically there is no N-1 Petal length when N = 1, I left the first value of testlag NA. Do you really need it to be initial value? If you need, this will work:
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length), default=first(testlag)))
The function you're looking for is tidyr::fill
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% fill(testlag, .direction = "down")
# Note the default is 'down', but I included here for completeness
This takes the specified column (testlag in this case), and copies any values in that column to the values below. This also works if you have a value in a subset of the rows: it copies the value down until it reaches a new value, then it picks up with that one.
For example:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris[[5,"testlag"]] <- 10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
Applying this function...
iris %>% fill(testlag, .direction = "down")
Gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa 5
3 4.7 3.2 1.3 0.2 setosa 5
4 4.6 3.1 1.5 0.2 setosa 5
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa 10
7 4.6 3.4 1.4 0.3 setosa 10
8 5.0 3.4 1.5 0.2 setosa 10
9 4.4 2.9 1.4 0.2 setosa 10
10 4.9 3.1 1.5 0.1 setosa 10
Example
Suppose in the famous iris data set, I have determined that when Sepal.Length > 5.0, there was an error in my measurement device.
In this contrived example, I would like to keep the Sepal.Length column with its original value, but change the remaining columns to NA if the Sepal.Length > 5.0 for that row.
As an example, this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Would become this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA NA
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA 1.7 NA NA
I could certain do this manually via vectorization. Something along the lines of:
iris$Sepal.Width <- ifelse(iris$Sepal.Length > 5.0, NA, iris$Sepal.Width)
In this approach however, I would need to manually specify every column.
Question
I strongly suspect there is a clever way to tackle this via either purrr or dplyr. Nevertheless, I've gotten myself down a pmap / modify_at rabbit hole. Any suggestions towards elegance would be much appreciated.
Thanks!
library(data.table)
dt <- copy(iris)
setDT(dt)
dt[Sepal.Length > 5.0, (which(!names(dt) == "Sepal.Length")) := NA]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 NA NA NA NA
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 NA NA NA NA
# 147: 6.3 NA NA NA NA
# 148: 6.5 NA NA NA NA
# 149: 6.2 NA NA NA NA
# 150: 5.9 NA NA NA NA
Alternative would be to simply use this (this is only handy if you are interested in all columns, beginning with the second one)
iris[iris$Sepal.Length > 5.0, 2:ncol(iris)] <- NA
# And the output for first six rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA <NA>
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 NA NA NA <NA>
It sounds like this would work for you
my_clip <- function(x, z) ifelse(z>5, NA, x)
iris %>% mutate_at(vars(-Sepal.Length), my_clip, z=.$Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA NA
# 2 4.9 3.0 1.4 0.2 1
# 3 4.7 3.2 1.3 0.2 1
# 4 4.6 3.1 1.5 0.2 1
# 5 5.0 3.6 1.4 0.2 1
# 6 5.4 NA NA NA NA
We use mutate_at to grab all the column we want to transform and then since you can't reference other columns easily in your mutate_at function, we need to pass in the threshold column as a separate parameter using the .$ syntax.
Since you asked for a purrr example, here goes. Although I prefer the data.table answer already proposed:
library(purrr)
library(tidyr)
iris %>% nest(-Sepal.Length) %>%
mutate(data = ifelse(Sepal.Length > 5.0,
map(data, function(x) x*NA), data)) %>%
unnest
With magrittr you could do this :
library(magrittr)
iris %>% head %>% inset(.$Sepal.Length > 5,-1,NA)
or using base R instead of magrittr (same output, just uglier function :), and you still need magrittr or dplyr for the pipes):
iris %>% head %>% `[<-`(.$Sepal.Length > 5,-1,NA)
-1 is the index of the column you want to keep, negated.
result
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 NA NA NA <NA>
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 NA NA NA <NA>
I would like to calculate the distance by group. Three classes in the data frame.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 versicolor
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 virginica
I have also calculated the row sum of each class
rsums = aggregate(iris$rsum, by=list(Class=iris$Species), FUN=sum)
and the outcome come is like
Class Centroid
1 setosa 1521.3
2 versicolor 2143.8
3 virginica 2571.0
So I need to subtract sum of each group to each row value of same group, to get the absolute difference, for example given below.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1-1521.3 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7-2143.8 3.2 1.3 0.2 versicolor
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4-2571.0 3.9 1.7 0.4 virginica
It think your question is a bit unclear. But does
iris$rsum <- rowSums(iris[,-5])
rsums <- aggregate(iris$rsum, by=list(Class=iris$Species), FUN=sum)
iris[,-(5:6)] <- iris[,-(5:6)] - rsums$x[iris$Species]
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species rsum
#1 -1516.2 -1517.8 -1519.9 -1521.1 setosa 30.6
#2 -1516.4 -1518.3 -1519.9 -1521.1 setosa 28.5
#3 -1516.6 -1518.1 -1520.0 -1521.1 setosa 28.2
#4 -1516.7 -1518.2 -1519.8 -1521.1 setosa 28.2
#5 -1516.3 -1517.7 -1519.9 -1521.1 setosa 30.6
#6 -1515.9 -1517.4 -1519.6 -1520.9 setosa 34.2
do what you want?
This utilizes the fact that iris$Species is a factor together with R's reuse rules when subtracting.
How do I retain just one observation in my dataset when the dataset contains two columns with duplicate values? For example if this is my dataset below:
row1 & row 2
col(Sepal.Length) and col(Petal.Length)
contain similar values (5.1, 1.4), (5.1, 1.4)
I want to remove the second row and just retain the first row.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 5.1 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 5.0 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Reproducible test data:
test12 <- head(iris)
test12[2,1] <- 5.1
Thanks in advance.
Use duplicated to compare those specific columns:
test12[!duplicated(test12[,c(1,3)]),]
## or referencing the column names themselves:
test12[!duplicated(test12[,c("Sepal.Length","Petal.Length")]),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 5.0 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
To keep only the first row:
row1 <- test12[1, ]
To drop the second row of your dataFrame:
dropRow <- test12[-2, ]