use dplyr to get values of a column [duplicate] - r

This question already has answers here:
Extract a dplyr tbl column as a vector
(8 answers)
Closed 7 years ago.
I'd like to have dplyr return a character vector instead of a data frame. Is there an easy way to do this?
#example data frame
df <- data.frame( x=c('a','b','c','d','e','f','g','h'),
y=c('a','a','b','b','c','c','d','d'),
z=c('a','a','a','a','a','a','d','d'),
stringsAsFactors = FALSE)
#desired output
unique(df$z)
[1] "a" "d"
#dplys's output
df %>%
select(z) %>%
unique()
z
1 a
7 d

Try
library(dplyr)
df %>%
select(z) %>%
unique() %>%
.$z
#[1] "a" "d"
Or using magrittr
library(magrittr)
df %>%
select(z) %>%
unique() %>%
use_series(z)
#[1] "a" "d"

Related

Find the factor level with the highest frequency in a column/variable [duplicate]

This question already has answers here:
How to find the statistical mode?
(35 answers)
Closed 1 year ago.
I've run this code
var <- c("A","A","A","A","B","B","B","B","B","B","C","C","C")
table(var)
> table(var)
var
A B C
4 6 3
The maximum frequency is 6, for factor "B".
Is there a function that just returns the name of the factor which has the highest frequency, "B".
Any help greatly appreciated. Thanks
A possible solution:
library(tidyverse)
var <- c("A","A","A","A","B","B","B","B","B","B","C","C","C")
table(var) %>% which.max %>% names
#> [1] "B"
In base R:
names(which.max(table(var)))
Using tidyverse:
library(tidyverse)
var <- c("A","A","A","A","B","B","B","B","B","B","C","C","C")
df <- tibble(var = var)
df %>%
count(var,sort = TRUE) %>%
slice(1) %>%
pull(var)
#> [1] "B"
Created on 2021-11-17 by the reprex package (v2.0.1)

arrange data based on user defined variables order? [duplicate]

This question already has answers here:
Order data frame rows according to vector with specific order
(6 answers)
Closed 1 year ago.
I have the following data.frame and would like to change the order of the rows in such a way that rows with variable == "C" come at the top followed by rows with "A" and then those with "B".
library(tidyverse)
set.seed(123)
D1 <- data.frame(Serial = 1:10, A= runif(10,1,5),
B = runif(10,3,6),
C = runif(10,2,5)) %>%
pivot_longer(-Serial, names_to = "variables", values_to = "Value" ) %>%
arrange(-desc(variables))
D1 %>%
mutate(variables = ordered(variables, c('C', 'A', 'B'))) %>%
arrange(variables)
Perhaps I did not get the question. If you want C then A then B, you could do:
D1 %>%
arrange(Serial, variables)
#Onyambu's answer is probably the most "tidyverse-ish" way to do it, but another is:
D1[order(match(D1$variables,c("C","A","B"))),]
or
D1 %>% slice(order(match(variables,c("C","A","B"))))
or
D1 %>% slice(variables %>% match(c("C","A","B")) %>% order())

Rename all columns with characters [duplicate]

This question already has answers here:
Add a prefix to column names
(4 answers)
Closed 3 years ago.
I need to rename all columns in my data.frame. Right now, they are numbered 1-150 (without the X) but I would like to add "id" before each number.
Right now:
c = data.frame(1, 2)
names(c)[1] <- "1"
names(c)[2] <- "2"
What I want: so that it is id1, id2 as each column name.
How can I do this?
You can use dplyr::rename_all()
library(dplyr)
iris %>%
rename_all(~ paste0("id_", .x)) %>%
names()
or with base R
setNames(
iris,
nm = paste0(
"id_", names(iris)
)
) %>% names()

R how to extract the data and its duplicate in a data.frame? [duplicate]

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector
FALSE, FALSE, FALSE, TRUE, TRUE
But in this case I actually want to get
FALSE, FALSE, TRUE, TRUE, TRUE
that is, I want to know whether a row is duplicated by a row with a larger subscript too.
duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by #jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr by doing
library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
Here is #Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
This is how vctrs::vec_duplicate_detect() works
# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1] TRUE FALSE TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1] TRUE FALSE TRUE
Created on 2022-07-19 by the reprex package (v2.0.1)
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
This updates #Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.
Thus, to get all rows for which there is a duplicate you can do this:
iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()
To include the indices of such rows, add a 'rowid' column but exclude it from the grouping:
iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()
Append %>% pull(rowid) after the above and you'll get a vector of the indices.

dplyr::select one column and output as vector [duplicate]

This question already has answers here:
Extract a dplyr tbl column as a vector
(8 answers)
Closed 8 years ago.
dplyr::select results in a data.frame, is there a way to make it return a vector if the result is one column?
Currently, I have to do extra step (res <- res$y) to convert it to vector from data.frame, see this example:
#dummy data
df <- data.frame(x = 1:10, y = LETTERS[1:10], stringsAsFactors = FALSE)
#dplyr filter and select results in data.frame
res <- df %>% filter(x > 5) %>% select(y)
class(res)
#[1] "data.frame"
#desired result is a character vector
res <- res$y
class(res)
#[1] "character"
Something as below:
res <- df %>% filter(x > 5) %>% select(y) %>% as.character
res
# This gives strange output
[1] "c(\"F\", \"G\", \"H\", \"I\", \"J\")"
# I need:
# [1] "F" "G" "H" "I" "J"
The best way to do it (IMO):
library(dplyr)
df <- data_frame(x = 1:10, y = LETTERS[1:10])
df %>%
filter(x > 5) %>%
.$y
In dplyr 0.7.0, you can now use pull():
df %>% filter(x > 5) %>% pull(y)
Something like this?
> res <- df %>% filter(x>5) %>% select(y) %>% sapply(as.character) %>% as.vector
> res
[1] "F" "G" "H" "I" "J"
> class(res)
[1] "character"
You could also try
res <- df %>%
filter(x>5) %>%
select(y) %>%
as.matrix() %>%
c()
#[1] "F" "G" "H" "I" "J"
class(res)
#[1] "character"

Resources