In R, dplyr mutate referencing column names by string [duplicate] - r

This question already has answers here:
How to pass dynamic column names in dplyr into custom function?
(2 answers)
Closed 1 year ago.
mydf = data.frame(a = c(1,2,3,4), b = c(5,6,7,8), c = c(3,4,5,6))
var1 = 'a'
var2 = 'b'
mydf = mydf %>% mutate(newCol = var1 + var2)
in our code, var1 and var2 can refer to different columns in mydf, and we need to create newCol by taking the sum of values in the columns whose names are saved in var1 and var2. I understand this can be done outside of dplyr, however I am wondering if there is a solution that uses dplyr and %>% like above.

We can convert to symbol and evaluate with !!
library(dplyr)
mydf %>%
mutate(newCol = !! rlang::sym(var1) + !! rlang::sym(var2))
Or another option is subset the column with .data
mydf %>%
mutate(newCol = .data[[var1]] + .data[[var2]])
or may use rowSums
mydf %>%
mutate(newCol = rowSums(select(cur_data(), all_of(c(var1, var2)))))

Related

arrange data based on user defined variables order? [duplicate]

This question already has answers here:
Order data frame rows according to vector with specific order
(6 answers)
Closed 1 year ago.
I have the following data.frame and would like to change the order of the rows in such a way that rows with variable == "C" come at the top followed by rows with "A" and then those with "B".
library(tidyverse)
set.seed(123)
D1 <- data.frame(Serial = 1:10, A= runif(10,1,5),
B = runif(10,3,6),
C = runif(10,2,5)) %>%
pivot_longer(-Serial, names_to = "variables", values_to = "Value" ) %>%
arrange(-desc(variables))
D1 %>%
mutate(variables = ordered(variables, c('C', 'A', 'B'))) %>%
arrange(variables)
Perhaps I did not get the question. If you want C then A then B, you could do:
D1 %>%
arrange(Serial, variables)
#Onyambu's answer is probably the most "tidyverse-ish" way to do it, but another is:
D1[order(match(D1$variables,c("C","A","B"))),]
or
D1 %>% slice(order(match(variables,c("C","A","B"))))
or
D1 %>% slice(variables %>% match(c("C","A","B")) %>% order())

Applying functions in dplyr pipes

Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914

Add multiple columns with mutate using column-based conditions, without using explicit column name + POSIX

I have a dataframe of data: 1 column is POSIX, the rest is data.
I need to remove selectively some data from a group of columns and add these "new" columns to the original dataframe.
I can "easily" do it in base R (I am an old-style user). I'd like to do it more compactly with mutate_at or with other function... although I am having several issues.
A solution homemade with base R could be
df <- data.frame("date" = seq.POSIXt(as.POSIXct(format(Sys.time(),"%F %T"),tz="UTC"),length.out=20,by="min"), "a.1" = rnorm(20,0,3), "a.2" = rnorm(20,1,2), "b.1"= rnorm(20,1,4), "b.2"= rnorm(20,3,4))
df1 <- lapply(df[,grep("^a",names(df))], function(x) replace(x, which(x > 0 & x < 0.2), NA))
df1 <- data.frame(matrix(unlist(df1), nrow = nrow(df), byrow = F)) ## convert to data.frame
names(df1) <- grep("^a",names(df),value=T) ## rename columns
df1 <- cbind.data.frame("date"=df$date, df1) ## add date
Can anyone help me in setting up something working with dplyr + transmute?
So far I come up with something like:
df %>%
select(starts_with("a.")) %>%
transmute(
case_when(
.>0.2 ~ NA,
)
) %>%
cbind.data.frame(df)
But I am quite stuck, since I can't combine transmute with case_when: all examples that I found use explicitly the column names in case_when, but I can't, since I won't know the names of the column in advance. I will only know the initial of the columns that I need to transmute.
Thanks,
Alex
We can use transmute_at if the intention is to return only those columns specified in the vars
library(dplyr)
df %>%
transmute_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
bind_cols(df %>% select(date), .)
If we need all the columns to return, but only change the columns of interest in vars, then we need mutate_at instead of transmute_at
df %>%
mutate_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
select(date, starts_with('a')) # only need if we are selecting a subset of columns

Using mutate_at() on a nested dataframe column to generate multiple unnested columns

I'm experimenting with dplyr, tidyr and purrr. I have data like this:
library(tidyverse)
set.seed(123)
df <- data_frame(X1 = rep(LETTERS[1:4], 6),
X2 = sort(rep(1:6, 4)),
ref = sample(1:50, 24),
sampl1 = sample(1:50, 24),
var2 = sample(1:50, 24),
meas3 = sample(1:50, 24))
Now dplyr is awesome because I can do things like mutate_at() to manipulate multiple columns at once. e.g:
df <- df %>%
mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>%
mutate_at(vars(contains("first")), funs(second = . *2 ))
and tidyr allows me nest subsets of the data as sub-tables in a single column:
df <- df %>% nest(-X1)
and thanks to purrr I can summarize these sub-tables while retaining the original data in the nested column:
df %>% mutate(mean = map_dbl(data, ~ mean(.x$meas3_first_second)))
How can I use purrr and mutate_at() to generate multiple summary columns (take the means of different (but not all) columns in each nested sub-table)?
In this example I'd like to take the mean of every column with the word "second" in it.I had hoped that this might produce a new nested column which I could then unnest() but it does not work.
df %>% mutate(mean = map(data, ~ mutate_at(vars(contains("second")),
funs(mean_comp_exp = mean(.)))))
How can I achieve this?
The comment by #aosmith was correct and helpful In addition I realised I needed to use summarise_at() and not mutate_at() like so:
df %>%
mutate(mean = map(data, ~ summarise_at(.x, vars(contains("second")),
funs(mean_comp_exp = mean(.) )))) %>%
unnest(mean)

Collapsing columns based on differences between groups using dplyr

I want to collapse multiple columns across groups such that the remaining summary statistic is the difference between the column values for each group. I have two methods but I have a feeling that there is a better way I should be doing this.
Example data
library(dplyr)
library(tidyr)
test <- data.frame(year = rep(2010:2011, each = 2),
id = c("A","B"),
val = 1:4,
val2 = 2:5,
stringsAsFactors = F)
Using summarize_each
test %>%
group_by(year) %>%
summarize_each(funs(.[id == "B"] - .[id == "A"]), val, val2)
Using tidyr
test %>%
gather(key,val,val:val2) %>%
spread(id,val) %>%
mutate(B.less.A = B - A) %>%
select(-c(A,B)) %>%
spread(key,B.less.A)
The summarize_each way seems relatively simple but I feel like there is a way to do this by grouping on id somehow? Is there a way that could ignore NA values in the columns?
We can use data.table
library(data.table)
setDT(test)[, lapply(.SD, diff), by = year, .SDcols = val:val2]
# year val val2
#1: 2010 1 1
#2: 2011 1 1

Resources