dplyr mutate_at function applied to multiple columns - using dynamic column names - r

I have a data frame x.
x <- data.frame(a = c(10, 20, 30, 0), b = c(1, 2, 3, 0), c = c(1, 2, 3, 0), d = c(8, 16, 24, 0))
x
denominator_var <- "a"
numerator_vars <- c("b", "c", "d")
Using dplyr, I'm trying to add new columns (b_share, c_share, and d_share) such that each of them are equal to the corresponding column (b, c, and d) divided into a.
However, it is important to me to use NOT the original variable names but dynamic variable names.
My code below is not working. What's wrong?
x %>% mutate_at(vars(one_of(numerator_vars)),
funs(share = ifelse(!!(denominator_var) > 0, round(./!!(denominator_var) * 100, 2), 0)))
Thank you very much!

You can try using as.name before applying !!:
x %>% mutate_at(vars(one_of(numerator_vars)), funs(share =
ifelse(!!(as.name(denominator_var)), round(./!!(as.name(denominator_var))) * 100, 2)))
# a b c d b_share c_share d_share
# 1 10 1 1 8 0 0 100
# 2 20 2 2 16 0 0 100
# 3 30 3 3 24 0 0 100
# 4 0 0 0 0 2 2 2

You can get the result you want by quoting your denominator beforehand:
denominator_var <- quo(a)
x %>% mutate_at(numerator_vars,
funs(share = ifelse(!!(denominator_var) > 0,
round(./!!(denominator_var) * 100, 2),
0)))
Also note that you don't need to use vars for your vector numerator_vars.

Related

How to write a for loop to create multiple new variables in R?

Suppose I have this example dataset df with only character variables.
dx_order1<-c(1, 1, NA, 1, 1)
dx_order2<-c(2, 2, 2, 2, NA)
Suppose that these variables are numeric.
I want to recode the variables. For dx_order1 variable, I want to recode 1 as 1 and 0 otherwise. Similarly, for dx_order 2 variable I want to recode 2 as 1 and 0 otherwise. Say that the new variables are called diag_order1 and diag_order2.
I know how to do this one by one in a manual fashion. The codes below will do the job:
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
I was wondering how I can achieve the same outcome with for loop function. If I have a a lot of similar variables then this type of manual coding is not practical. So any advice on how to have a loop to fasten the process would be appreciated.
You don't need to use loop in this instance, you could do this by converting NA to 0 using is.na. For example:
Data
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[!is.na(df)] <- 1
df[is.na(df)] <- 0
Or if you have more columns with NA but only want to apply to certain columns then you could do it by specifying those columns:
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
# any columns starting with dx
cols <- names(df2)[grepl("^dx", names(df2))]
df2[, cols][!is.na(df2[, cols])] <- 1
df2[, cols][is.na(df2[, cols])] <- 0
You can use across with mutate in dplyr like this
library(dplyr)
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
> df2
letter_col dx_order1 dx_order2
1 <NA> 1 2
2 a 1 2
3 b NA 2
4 c 1 2
5 d 1 NA
df2 %>% mutate(across(starts_with("dx"), ~case_when(. == as.numeric(str_extract(cur_column(), "\\d$")) ~ 1,
is.na(.) ~ 0,
TRUE ~ 0), .names = "diag_{.col}"))
letter_col dx_order1 dx_order2 diag_dx_order1 diag_dx_order2
1 <NA> 1 2 1 1
2 a 1 2 1 1
3 b NA 2 0 1
4 c 1 2 1 1
5 d 1 NA 1 0
Assuming that your dx column can have values like suffix, NA and otherwise too as written in your question, and it recodes everything else than suffix to 0
You can coerce the logical vector from is.na to integer. is.na works with the dataframe.
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[] <- +!is.na(df)
df
# dx_order1 dx_order2
#1 1 1
#2 1 1
#3 0 1
#4 1 1
#5 1 0

Perform a function on a dataframe across variable number of columns after removing zeros

I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1

Going from a list of elements to chemical formula

I have a list of elemental compositions, each element in it's own row. Sometimes these elements have a zero.
C H N O S
1 5 5 0 0 0
2 6 4 1 0 1
3 4 6 2 1 0
I need to combine them so that they read, e.g. C5H5, C6H4NS, C4H6N2O.
This means that for any element of value "1" I should only take the column name, and for anything with value 0, the column should be skipped altogether.
I'm not really sure where to start here. I could add a new column to make it easier to read across the columns, e.g.
c C h H n N o O s S
1 C 5 H 5 N 0 O 0 S 0
2 C 6 H 4 N 1 O 0 S 1
3 C 4 H 6 N 2 O 1 S 0
This way, I just need the output to be a single string, but I need to ignore any zero values, and drop the one after the element name.
And here a base R solution:
df = read.table(text = "
C H N O S
5 5 0 0 0
6 4 1 0 1
4 6 2 1 0
", header=T)
apply(df, 1, function(x){return(gsub('1', '', paste0(colnames(df)[x > 0], x[x > 0], collapse='')))})
[1] "C5H5" "C6H4NS" "C4H6N2O"
paste0(colnames(df)[x > 0], x[x > 0], collapse='') pastes together the column names where the row values are bigger than zero. gsub then removes the ones. And apply does this for each row in the data frame.
Here's a tidyverse solution that uses some reshaping:
df = read.table(text = "
C H N O S
5 5 0 0 0
6 4 1 0 1
4 6 2 1 0
", header=T)
library(tidyverse)
df %>%
mutate(id = row_number()) %>% # add row id
gather(key, value, -id) %>% # reshape data
filter(value != 0) %>% # remove any zero rows
mutate(value = ifelse(value == 1, "", value)) %>% # replace 1 with ""
group_by(id) %>% # for each row
summarise(v = paste0(key, value, collapse = "")) # create the string value
# # A tibble: 3 x 2
# id v
# <int> <chr>
# 1 1 C5H5
# 2 2 C6H4NS
# 3 3 C4H6N2O
Assume that the input matrix m is as given reproducibly in the Note at the end -- convert it to a matrix if it is a data frame using as.matrix.
Now create a matrix the same shape as m with just the letters so now lets contains the letters and m contains the numbers. Then paste the letters and numbers together and replace those cells for which the number is zero with the empty string. Also replace any cells for which the number is 1 with just the letter. Finally paste each row together. No packages are used and no loops or *apply are used.
lets <- t(replace(t(m), TRUE, colnames(m)))
mm <- paste0(lets, m)
mm <- replace(mm, m == 0, "")
mm <- ifelse(m == 1, lets, mm)
do.call("paste0", as.data.frame(mm))
## [1] "C5H5" "C6H4NS" "C4H6N2O"
Note
the input matrix m in reproducible form is assumed to be:
m <- matrix(c(5, 6, 4, 5, 4, 6, 0, 1, 2, 0, 0, 1, 0, 1, 0), 3, 5,
dimnames = list(NULL, c("C", "H", "N", "O", "S")))
Another idea that avoids the apply with margin 1,
gsub('1', '', sapply(split(df, 1:nrow(df)), function(i)
paste(paste0(names(i)[i != 0], i[i != 0]), collapse = '')))
# 1 2 3
# "C5H5" "C6H4NS" "C4H6N2O"
Another option
library(dplyr)
#Get indices of all non-zero numbers in the dataframe
inds <- which(df!=0, arr.ind = TRUE)
#Create a dataframe with row index, column index and value at that position
vals <- data.frame(inds, val = df[inds])
#For each row paste the name of the column and value together and then replace 1
vals %>%
group_by(row) %>%
summarise(chemical = paste0(names(df)[col], val,collapse = "")) %>%
mutate(chemical = gsub("[1]", "", chemical))
# row chemical
# <int> <chr>
#1 1 C5H5
#2 2 C6H4NS
#3 3 C4H6N2O

Create a vector of counts

I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))

Imputing labels based on a comparison of columns

I don't think this question has been asked on this board before. I have two columns of 1s and 0s in a dataframe. Let's call these columns X and Y, respectively. In a comparison of X and Y for any row, one of four combinations is obviously possible:
A: 1, 0
B: 0, 1
C: 1, 1
D: 0, 0
Imagine the dataframe has m columns total, but we're interested only in X and Y. I'd like to write a function that compares only X and Y and then characterizes the particular combination with the corresponding labels A, B, C, or D in a new column (let's call it Z).
So say the data looks like:
X Y
1 1
0 1
0 0
1 1
The function will ouput:
X Y Z
1 1 C
0 1 B
0 0 D
1 1 C
I imagine this would be trivial but I'm an R newbie. Thanks for any guidance!
We create a key/value combination unique dataset and then merge with the input dataset based on 'X' and 'Y' columns
merge(df1, KeyDat, by = c("X", "Y"), all.x=TRUE)
# X Y Z
#1 0 0 D
#2 0 1 B
#3 1 1 C
#4 1 1 C
Or to get the output in the same order, use left_join
library(dplyr)
left_join(df1, keyDat)
#Joining by: c("X", "Y")
# X Y Z
#1 1 1 C
#2 0 1 B
#3 0 0 D
#4 1 1 C
data
keyDat <- data.frame(X= c(1, 0, 1, 0), Y = c(0, 1, 1,
0), Z = c("A", "B", "C", "D"), stringsAsFactors=FALSE)
df1 <- data.frame(X= c(1, 0, 0, 1), Y=c(1, 1, 0, 1))

Resources