How to apply function to colnames in tidyverse - r

Just like in title: is there any function that allows applying another function to column names of data frame? I mean something like forcats::fct_relabel that applies some function to factor labels.
To give an example, supose I have a data.frame as below:
X<-data.frame(
firababst = c(1,1,1),
secababond = c(2,2,2),
thiababrd = c(3,3,3)
)
X
firababst secababond thiababrd
1 1 2 3
2 1 2 3
3 1 2 3
Now I wish to get rid of abab from column names by applying stringr::str_remove. My workaround involves magrittr::set_colnames:
X %>%
set_colnames(colnames(.) %>% str_remove('abab'))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
Can you suggest some more strightforward way? Ideally, something like:
X %>%
magic_foo(str_remove, 'abab')

You can do:
X %>%
rename_all(~ str_remove(., "abab"))
first second third
1 1 2 3
2 1 2 3
3 1 2 3

With base R, we can do
names(X) <- sub("abab", "", names(X))

Related

Rename dataframe column names by switching string patterns

I have following dataframe and I want to rename the column names to c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")
> dataf <- data.frame(
+ WBC_D7_MIN=1:4,WBC_D7_MAX=1:4,DBP_D3_MIN=1:4
+ )
> dataf
WBC_D7_MIN WBC_D7_MAX DBP_D3_MIN
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
> names(dataf)
[1] "WBC_D7_MIN" "WBC_D7_MAX" "DBP_D3_MIN"
Probably, the rename_with function in tidyverse can do it, But I cannot figure out how to do it.
You can use capture groups with sub to extract values in order -
names(dataf) <- sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', names(dataf))
Same regex can be used in rename_with -
library(dplyr)
dataf %>% rename_with(~ sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', .))
# WBC_MIN_D7 WBC_MAX_D7 DBP_MIN_D3
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
You can rename your dataf with your vector with names(yourDF) <- c("A","B",...,"Z"):
names(dataf) <- c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

Apply a vector of filters based on a string (or vector of strings) in dplyr

R and the tidyverse have some extremely powerful but equally mysterious methods for turning strings into actionable expressions. I feel like one needs to be an expert to really understand how to use them.
NOTE: this question differs from this one in that I specifically ask about a vector (that is multiple) filter conditions. I demonstrate a solution for single filters that fails when I try multiple ways of extending it to multiple filters.
I want to do something along the lines of:
df = data.frame(A=1:10, B=1:10)
df %>% filter(A<3, B<5)
But where the filters are contained in either a string such as "A<3, B<5" or a character vector such as c("A<3", "B<5").
I can do
df %>% filter(eval(str2expression("A<3")))
# A B
# 1 1 1
# 2 2 2
But this does not work:
df %>% filter(eval(str2expression("A<3, B<5")))
Error in str2expression("A<3, B<5") : <text>:1:4: unexpected ','
1: A<3,
^
These don't work either:
> df %>% filter(!!c(str2expression("A<3"), str2expression("B<5")))
Error: Argument 2 filter condition does not evaluate to a logical vector
> df %>% filter(!!!c(str2expression("A<3"), str2expression("B<5")))
Error: Can't splice an object of type `expression` because it is not a vector
Run `rlang::last_error()` to see where the error occurred.
Evaluating a vector of expressions from str2expression for some reason only applies the last expression:
> df %>% filter(eval(c(str2expression("A<3"), str2expression("B<5"))))
# A B
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
Using a vector of evaluated expressions fails altogether:
> df %>% filter(!!!c(eval(str2expression("A<3")), eval(str2expression("B<5"))))
Error in eval(str2expression("A<3")) : object 'A' not found
I can do:
> df %>% filter(!!!c(expr(A<3), expr(B<5)))
# A B
# 1 1 1
# 2 2 2
and this tells me that expr(A<3) is NOT the same thing as str2expression("A<3")
But that isn't starting from strings.
What to do?
You could use parse_exprs from rlang
library(dplyr)
expr <- c("A<3", "B<5")
filter(df, !!!rlang::parse_exprs(expr))
# A B
#1 1 1
#2 2 2
Or you could combine the two expressions and then use it in eval
filter(df, eval(parse(text = paste0(expr, collapse = "&"))))
# A B
#1 1 1
#2 2 2
Learning from #Ronak Shah's answer, apparently, in dplyr I can use multiple conditions with a single & in filter instead of a comma. I don't understand this at all---it is not the same thing as an and logical:
> df %>% filter(A<3 & B<5)
A B
1 1 1
2 2 2
> df %>% filter(A<3 && B<5)
A B
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
Nevertheless, the following does work:
> df %>% filter(eval(str2expression("A<3 & B<5")))
A B
1 1 1
2 2 2
> df %>% filter(eval(str2expression("A<6 & B<5")))
A B
1 1 1
2 2 2
3 3 3
4 4 4

How to sort a vector and print Top X value when the value is flat?

R language : How to sort a vector and print Top X value when the value is flat ?
If I have a vector like
v <- c(1,2,3,3,4,5)
I want to print the TOP1~TOP3 values.
So I use:
sort(v)[1:3]
[1] 1 2 3
In this case,TOP3 have 2 value
what I want to print is:
[1] 1 2 3 3
and their index
One way to do it:
v[v %in% sort(v)[1:3]]
# [1] 1 2 3 3
# following up OP's comment, if you want ordered outcomes:
# sort(v[v %in% sort(v)[1:3]])
We can use top_n from dplyr
library(dplyr)
data.frame(v) %>% top_n(-3)
# v
#1 1
#2 2
#3 3
#4 3
this returns a dataframe, if you want a vector pull it
data.frame(v) %>% top_n(-3) %>% pull(v)
#[1] 1 2 3 3

How to integrate set of vector in multiple data.frame into one without duplication?

I have position index vector in data.frame objects, but in each data.frame object, the order of position index vector are very different. However, I want to integrate/ merge these data.frame object object in one common data.frame with very specific order and not allow to have duplication in it. Does anyone know any trick for doing this more easily? Can anyone propose possible approach how to accomplish this task?
data
v1 <- data.frame(
foo=c(1,2,3),
bar=c(1,2,2),
bleh=c(1,3,0))
v2 <- data.frame(
bar=c(1,2,3),
foo=c(1,2,0),
bleh=c(3,3,4))
v3 <- data.frame(
bleh=c(1,2,3,4),
foo=c(1,1,2,0),
bar=c(0,1,2,3))
initial output after integrating them:
initial_output <- data.frame(
foo=c(1,2,3,1,2,0,1,1,2,0),
bar=c(1,2,2,1,2,3,0,1,2,3),
bleh=c(1,3,0,3,3,4,1,2,3,4)
)
remove duplication
rmDuplicate_output <- data.frame(
foo=c(1,2,3,1,0,1,1),
bar=c(1,2,2,1,3,0,1),
bleh=c(1,3,0,3,4,1,2)
)
final desired output:
final_output <- data.frame(
foo=c(1,1,1,1,2,3,0),
bar=c(0,1,1,1,2,2,3),
bleh=c(1,1,2,3,3,0,4)
)
How can I get my final desired output easily? Is there any efficient way for doing this sort of manipulation for data.frame object? Thanks
You could also use use mget/ls combo in order to get your data frames programmatically (without typing individual names) and then use data.tables rbindlist and unique functions/method for great efficiency gain (see here and here)
library(data.table)
unique(rbindlist(mget(ls(pattern = "v\\d+")), use.names = TRUE))
# foo bar bleh
# 1: 1 1 1
# 2: 2 2 3
# 3: 3 2 0
# 4: 1 1 3
# 5: 0 3 4
# 6: 1 0 1
# 7: 1 1 2
As a side note, it usually better to keep multiple data.frames in a single list so you could have better control over them
We can use bind_rows from dplyr, remove the duplicates with distinct and arrange by 'bar'
library(dplyr)
bind_rows(v1, v2, v3) %>%
distinct %>%
arrange(bar)
# foo bar bleh
#1 1 0 1
#2 1 1 1
#3 1 1 3
#4 1 1 2
#5 2 2 3
#6 3 2 0
#7 0 3 4
Here is a solution:
# combine dataframes
df = rbind(v1, v2, v3)
# remove duplicated
df = df[! duplicated(df),]
# sort by 'bar' column
df[order(df$bar),]
foo bar bleh
7 1 0 1
1 1 1 1
4 1 1 3
8 1 1 2
2 2 2 3
3 3 2 0
6 0 3 4

Resources