mutate(across) to generate multiple new columns in tidyverse - r

I usually have to perform equivalent calculations on a series of variables/columns that can be identified by their suffix (ranging, let's say from _a to _i) and save the result in new variables/columns. The calculations are equivalent, but vary between the variables used in the calculations. These again can be identified by the same suffix (_a to _i). So what I basically want to achieve is the following:
newvar_a = (oldvar1_a + oldvar2_a) - z
...
newvar_i = (oldvar1_i + oldvar2_i) - z
This is the farest I got:
mutate(across(c(oldvar1_a:oldvar1_i), ~ . - z, .names = "{col}_new"))
Thus, I'm able to "loop" over oldvar1_a to oldvar1_i, substract z from them and save the results in new columns named oldvar1_a_new to oldvar1_i_new. However, I'm not able to include oldvar2_a to oldvar2_i in the calculations as R won't loop over them. (Additionally, I'd still need to rename the new columns).
I found a way to achieve the result using a for-loop. However, this definitely doesn't look like the most efficient and straightforward way to do it:
for (i in letters[1:9]) {
oldvar1_x <- paste0("oldvar1_", i)
oldvar2_x <- paste0("oldvar2_", i)
newvar_x <- paste0("newvar_", i)
df <- df %>%
mutate(!!sym(newvar_x) := (!!sym(oldvar1_x) + !!sym(oldvar2_x)) - z)
}
Thus, I'd like to know whether/how to make mutate(across) loop over multiple columns that can be identified by suffixes (as in the example above)

In this case, you can use cur_data() and cur_column() to take advantage that we are wanting to sum together columns that have the same suffix but just need to swap out the numbers.
library(dplyr)
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,10,20)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ (.x + cur_data()[gsub("1", "2", cur_column())]) - z,
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar2_a oldvar2_i
#> 1 1 4 7 10 1 4 16
#> 2 2 5 8 11 10 -3 9
#> 3 3 6 9 12 20 -11 1
If you want to use with case_when, just make sure to index using [[, you can read more here.
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,2,0)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ case_when(
z == 1 ~ .x,
z == 2 ~ cur_data()[[gsub("1", "2", cur_column())]],
TRUE ~ NA_integer_
),
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar1_a_new oldvar1_i_new
#> 1 1 4 7 10 1 1 7
#> 2 2 5 8 11 2 5 11
#> 3 3 6 9 12 0 NA NA

There is a fairly straightforward way to do what I believe you are attempting to do.
# first lets create data
library(dplyr)
df <- data.frame(var1_a=runif(10, min = 128, max = 131),
var2_a=runif(10, min = 128, max = 131),
var1_b=runif(10, min = 128, max = 131),
var2_b=runif(10, min = 128, max = 131),
var1_c=runif(10, min = 128, max = 131),
var2_c=runif(10, min = 128, max = 131))
# taking a wild guess at what your z is
z <- 4
# initialize a list
fnl <- list()
# iterate over all your combo, put in list
for (i in letters[1:3])
{
dc <- df %>% select(ends_with(i))
i <- dc %>% mutate(a = rowSums(dc[1:ncol(dc)]) - z)
fnl <- append(fnl, i)
}
# convert to a dataframe/tibble
final <- bind_cols(fnl)
I left the column names sloppy assuming you had specific requirements here. You can convert this loop into a function and do the whole thin in a single step using purrr.

Related

Delete columns in an R loop

I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)
Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.
Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.

Joint Occurrence of variables in R

I want to count individual and combine occurrence of variables (1 represents presence and 0 represents absence). This can be obtained by multiple uses of table function (See MWE below). Is it possible to use a more efficient approach to get the required output given below?
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
table(A)
A
0 1
48 52
table(B)
B
0 1
53 47
table(C)
C
0 1
34 66
table(A, B)
B
A 0 1
0 25 23
1 28 24
table(A, C)
C
A 0 1
0 12 36
1 22 30
table(B, C)
C
B 0 1
0 21 32
1 13 34
table(A, B, C)
, , C = 0
B
A 0 1
0 8 4
1 13 9
, , C = 1
B
A 0 1
0 17 19
1 15 15
Required Output
I am requiring something like the following:
A = 52
B = 45
C = 66
A + B = 24
A + C = 30
B + C = 34
A + B + C = 15
Expanding on Sumedh's answer, you can also do this dynamically without having to specify the filter every time. This will be useful if you have more than only 3 columns to combine.
You can do something like this:
lapply(seq_len(ncol(df)), function(i){
# Generate all the combinations of i element on all columns
tmp_i = utils::combn(names(df), i)
# In the columns of tmp_i we have the elements in the combination
apply(tmp_i, 2, function(x){
dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & ")))
df %>%
filter_(.dots = dynamic_formula) %>%
summarize(Count = n()) %>%
mutate(type = paste0(sort(x), collapse = ""))
}) %>%
bind_rows()
}) %>%
bind_rows()
This will:
1) generate all the combinations of the columns of df. First the combinations with one element (A, B, C) then the ones with two elements (AB, AC, BC), etc.
This is the external lapply
2) then for every combination will create a dynamic formula. For AB for instance the formula will be A==1 & B==1, exactly as Sumedh suggested. This is the dynamic_formula bit.
3) Will filter the dataframe with the dynamically generated formula and count the number of rows
4) Bind all together (the two bind_rows)
The output will be
Count type
1 52 A
2 47 B
3 66 C
4 24 AB
5 30 AC
6 34 BC
7 15 ABC
EDITED TO ADD: I see now that you don't want to get the exclusive counts (i.e. A and AB should both include all As).
I got more than a little nerd-sniped by this today, particularly as I wanted to solve it using base R with no packages. The below should do that.
There is a very easy (in principle) solution that simply uses xtabs(), which I've illustrated below. However, to generalize it for any potential number of dimensions, and then to apply it to a variety of combinations, actually was harder. I strove to avoid using the dreaded eval(parse()).
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
# Turn strings off
options(stringsAsFactors = FALSE)
# Obtain the n-way frequency table
# This table can be directly subset using []
# It is a little tricky to pass the arguments
# I'm trying to avoid eval(parse())
# But still give a solution that isn't bound to a specific size
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))),
data = df)
# Demonstrating what I mean
# All A
sum(xtab_freq["1",,])
# [1] 52
# AC
sum(xtab_freq["1",,"1"])
# [1] 30
# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3
# The output of combn() goes through list(), then is unlisted with recursive FALSE
# This gives us a list of vectors
# Each one being a combination in which we are interested
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE)
# For nice output naming, I just paste the values together
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "")
# This is a function I put together
# Generalizes process of extracting values from a crosstab
# It does it in this fashion to avoid eval(parse())
uFunc_GetMargins <- function(crosstab,varvector,success) {
# Obtain the dimname-names (the names within each dimension)
# From that, get the regular dimnames
xtab_dnn <- dimnames(crosstab)
xtab_dn <- names(xtab_dnn)
# Use match() to get a numeric vector for the margins
# This can be used in margin.table()
tgt_margins <- match(x = varvector,table = xtab_dn)
# Obtain a margin table
marginal <- margin.table(x = crosstab,margin = tgt_margins)
# To extract the value, figure out which marginal cell contains
# all variables of interest set to success
# sapply() goes over all the elements of the dimname names
# Finds numeric index in that dimension where the name == success
# We subset the resulting vector by tgt_margins
# (to only get the cells in our marginal table)
# Then, use prod() to multiply them together and get the location
tgt_cell <- prod(sapply(X = xtab_dnn,
FUN = match,
x = success)[tgt_margins])
# Return as named list for ease of stacking
return(list(count = marginal[tgt_cell]))
}
# Doing a call of mapply() lets us get the results
do.call(what = rbind.data.frame,
args = mapply(FUN = uFunc_GetMargins,
varvector = lst_combs,
MoreArgs = list(crosstab = xtab_freq,
success = "1"),
SIMPLIFY = FALSE,
USE.NAMES = TRUE))
# count
# A 52
# B 47
# C 66
# AB 24
# AC 30
# BC 34
# ABC 15
I ditched the prior solution that used aggregate.
Using dplyr,
Occurrence of only A:
library(dplyr)
df %>% filter(A == 1) %>% summarise(Total = nrow(.))
Occurrence of A and B:
df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.))
Occurence of A, B, and C
df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.))

How does dplyr filter works in R?

I want to filter only the rows that are less than 10 units away form the point (1,1). My dataframe has two columns, x and y.
This is what I have tried:
filter(df, dist( rbind(c(1,2), c(x,y)) ) < 10 )
But, this is not working. It always returns a 0 row result, although I know that it should return a couple of rows. How can I debug this? I would like to print every value passed to x and y in every iteration.
Per request, this is the output of dput(head(df)):
structure(list(x = c(1, 2, 3, 4, 5), y = c(1, 1, 1, 1, 1)), .Names = c("x",
"y"), row.names = c(NA, 5L), class = "data.frame")
I would use your data but it is not affected by the filter. So I will create something random:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(20, size = 20, replace = TRUE),
y = sample(20, size = 20, replace = TRUE))
head(df)
# Source: local data frame [6 x 2]
# x y
# <int> <int>
# 1 19 19
# 2 19 3
# 3 6 20
# 4 17 19
# 5 13 2
# 6 11 11
The problem is that dplyr::filter requires a vector of logical. If you manually check the return of dist(...), it is returning an "n-by-n" array. It is not clear how exactly filter should presume to use that.
If your data really is just one point (c(1, 2)), then you need to manually calculate the distance between the known point and the variables of the data.frame, such as:
filter(df, sqrt( (x - 1)^2 + (y - 2)^2 ) < 10)
# Source: local data frame [2 x 2]
# x y
# <int> <int>
# 1 10 1
# 2 3 5
(I'm assuming euclidean distance here.) If you have more dimensions and/or a slightly different distance equation, the application should be straight-forward.
If you are instead interested in the distance between all points in df (as your call to dist implies), then you may need to use which(..., arr.ind = TRUE) and some trickery. Or perhaps do an outer join between these (df) points and other points.

Sum observations from two columns, looping over many columns in R

I have searched high and low, but am stuck on how to approach this. I have two sets of columns that I want to sum, row by row, but which I want to loop over many columns. If I were to do this manually, I would want:
df1[1,1]+df2[1,1]
df1[2,1]+df2[2,1]
etc... I've found many helpful examples on how to do something like:
apply(df[,c("a","d")], 1, sum)
though I want to do this over lots of columns. Also, while it's not entirely relevant, I want to phrase my question as close to my reality as possible, so my example below includes NA's, since my actual data contains many missing values.
# make a data frame, df1, with three columns
a <- sample(1:100, 50, replace = T)
b <- sample(100:300, 50, replace = T)
c <- sample(2:50, 500, replace = T)
df1 <- cbind(a,b,c)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
Make it possible to randomly throw a few NAs in, function from http://www.r-bloggers.com/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
Add the NAs to the frames
NAins(df1, .2)
NAins(df2, .14)
Then, I tried to seq along the columns in each data frame, and used apply setting the index to 1, meaning to sum each row entry. This doesn't work.
for(i in seq_along(df1)){
for(j in seq_along(df2)){
apply(c(df1[,i], col2[j]), 1, function(x) sum(x, na.rm = T))}}
Thanks for any help!
You should be able to just replace NA with 0, and then add with "+":
replace(df1, is.na(df1), 0) + replace(df2, is.na(df2), 0)
# X Y Z
# 1 7 19 6
# 2 11 12 1
# 3 16 14 11
# 4 13 7 13
# 5 10 2 11
Alternatively, if you have more than just two data.frames, you can collect them in a list and use Reduce:
Reduce("+", lapply(mget(c("df1", "df2", "df3")), function(x) replace(x, is.na(x), 0)))
Here's some sample data (and what I think is an easier way to create it):
set.seed(1) ## Set a seed so others can reproduce your sample data
dfmaker <- function() {
setNames(
data.frame(
replicate(3, sample(c(NA, 1:10), 5, TRUE), FALSE)),
c("X", "Y", "Z"))
}
df1 <- dfmaker()
df1
# X Y Z
# 1 2 9 2
# 2 4 10 1
# 3 6 7 7
# 4 9 6 4
# 5 2 NA 8
df2 <- dfmaker()
df2
# X Y Z
# 1 5 10 4
# 2 7 2 NA
# 3 10 7 4
# 4 4 1 9
# 5 8 2 3
df3 <- dfmaker()
You can transform the data.frame to an array and sum them using apply function.
install.package('abind')
library(abind)
df <- abind(list(df1,df2), along = 3)
results <- apply(df, MARGIN = c(1,2), FUN = function(x) sum(x, na.rm = TRUE))
results

Subtract a column in a dataframe from many columns in R

I have a dataframe. I'd like to subtract the 2nd column from all other columns. I can do it in a loop, but I'd like to do it in one call. Here's my working loop code:
df <- data.frame(x = 100:101, y = 2:3,z=3:4,a = -1:0,b=4:5)
for( i in 3:length(df) ) {
df[i] <- df[i] - df[2]
}
If you need to subtract the columns 3:ncol(df) from the second column
df[3:ncol(df)] <- df[3:ncol(df)]-df[,2]
Another solution using dplyr::mutate_at() function
# install.packages("dplyr", dependencies = TRUE)
library(dplyr)
df <- data.frame(x = 100:101, y = 2:3, z = 3:4, a = -1:0, b = 4:5)
df %>%
mutate_at(vars(-matches("y"), -matches("x")), list(dif = ~ . - y))
#> x y z a b z_dif a_dif b_dif
#> 1 100 2 3 -1 4 1 -3 2
#> 2 101 3 4 0 5 1 -3 2
Created on 2019-11-05 by the reprex package (v0.3.0)
This would also work - returns the 9 columns that you subtracted the second one from.
df = data.frame(matrix(rnorm(100,0,1),nrow = 10))
df[,-2] - df[,2]

Resources