dplyr::select_ behavior with factor - r

I use dplyr::select to select columns of my dataset. I observe an interesting phenomenon about select_ with factor and want to ask why this happens.
I have a 4x3 data frame and want to select column "a" and "c"
x <- matrix(1:12, ncol = 3) %>%
as.data.frame() %>%
`colnames<-`(c("a","b", "c"))
# works, output: "a" "c"
x %>% select_(.dots = c("a", "c")) %>% colnames()
# change the search term to a factor, output wrong columns: "a" "b"
x %>% select_(.dots = as.factor(c("a", "c"))) %>% colnames()
Could you kindly give a hint why this happens?

The problem is that the factor is stored internally as integer. So, it is coercing to integer, resulting in 1, 2 and select selects the 1st two. In general, the select_ with .dots method is outdated. We can use quosures or select_at, select_if etc
x %>%
select_at(vars(a, c))
Or
x %>%
select(a, c)
Or
x %>%
select(!!! quos(a, c))

Related

Remove Duplicate Values from List in R entirely [duplicate]

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by #jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
Here is #Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
This is how vctrs::vec_duplicate_detect() works
# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1] TRUE FALSE TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1] TRUE FALSE TRUE
Created on 2022-07-19 by the reprex package (v2.0.1)
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
This updates #Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.
Thus, to get all rows for which there is a duplicate you can do this:
iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()
To include the indices of such rows, add a 'rowid' column but exclude it from the grouping:
iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()
Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Keep duplicates based one colum in R [duplicate]

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by #jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
Here is #Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
This is how vctrs::vec_duplicate_detect() works
# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1] TRUE FALSE TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1] TRUE FALSE TRUE
Created on 2022-07-19 by the reprex package (v2.0.1)
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
This updates #Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.
Thus, to get all rows for which there is a duplicate you can do this:
iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()
To include the indices of such rows, add a 'rowid' column but exclude it from the grouping:
iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()
Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Get rows where one value is duplicated [duplicate]

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by #jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
Here is #Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
This is how vctrs::vec_duplicate_detect() works
# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1] TRUE FALSE TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1] TRUE FALSE TRUE
Created on 2022-07-19 by the reprex package (v2.0.1)
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
This updates #Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.
Thus, to get all rows for which there is a duplicate you can do this:
iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()
To include the indices of such rows, add a 'rowid' column but exclude it from the grouping:
iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()
Append %>% pull(rowid) after the above and you'll get a vector of the indices.

R: Check if all values of one column match uniquely all values of another column

I have a data set with a lot of values. The majority of x matches a value in y uniquely. However some of x match multiple ys. Is there an easy way to find which values of y map to multiple xs?
mydata <- data.frame(x = c(letters,letters), y=c(LETTERS,LETTERS))
mydata$y[c(3,5)] <- "A"
mydata$y[c(10,15)] <- "Z"
mydata %>% foo
[1] "A" "Z"
I apologize if I am missing some obvious command here.
Using dplyr, you can do:
library(dplyr)
mydata <- data.frame(x = letters, y=LETTERS, stringsAsFactors = FALSE)
mydata$y[c(3,5)] <- "A"
mydata$y[c(10,15)] <- "Z"
mydata %>% group_by(y) %>% filter(n() > 1)
If you want to extract just the y values, you can store that to a data frame like this and find unique y values:
df <- mydata %>% group_by(y) %>% filter(n() > 1)
unique(df$y)
Another alternative format to get the same output into is as follows. This returns a single column data frame instead of a vector as above.
mydata %>% group_by(y) %>% filter(n() > 1) %>% select(y) %>% distinct()
use data.table
library(data.table)
setDT(mydata)
mydata[,list(n=length(unique(x))), by=y][n>2,]
# y n
# 1: A 3
# 2: Z 3
If we need the corresponding unique values in 'x'
library(data.table)
setDT(mydata)[,if(.N >2) toString(unique(.SD[[1L]])) , y]
# y V1
#1: A a, c, e
#2: Z j, o, z

R how to extract the data and its duplicate in a data.frame? [duplicate]

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by #jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
Here is #Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
This is how vctrs::vec_duplicate_detect() works
# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1] TRUE FALSE TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1] TRUE FALSE TRUE
Created on 2022-07-19 by the reprex package (v2.0.1)
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
This updates #Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.
Thus, to get all rows for which there is a duplicate you can do this:
iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()
To include the indices of such rows, add a 'rowid' column but exclude it from the grouping:
iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()
Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Resources