add new column to dataframe by referencing name of existing columns - r

I have a dataframe of this form:
df <- data.frame(abc = c(1, 0, 3, 2, 0),
foo = c(0, 4, 2, 1, 0),
glorx = c(0, 0, 0, 1, 2))
Here, the column names are strings and the values in the data frame are the number of times I would like to concatenate that string in a new data column. The new column I'd like to create would be a concatenation across all existing columns, with each column name being repeated according to the data.
For example, I'd like to create this new column and add it to the dataframe.
new_col <- c('abc', 'foofoofoofoo', 'abcabcabcfoofoo', 'abcabcfooglorx', 'glorxglorx')
also_acceptable <- c('abc', 'foofoofoofoo', 'abcfooabcfooabc', 'abcfooglorxabc', 'glorxglorx')
df %>% mutate(new_col = new_col, also_acceptable = also_acceptable)
The order of concatenation does not matter. The core problem I have is I don't know how to reference the name of a column by row when constructing a purrr::map() or dplyr::mutate() function to build a new column. Thus, I'm not sure how to programatically construct this new column.
(The core application here is combinatorial construction of chemical formulae in case anyone wonders why I would need such a thing.)

Here is an option using Map and strrep:
mutate(df, new_col = do.call(paste, c(sep="", Map(strrep, names(df), df))))
# abc foo glorx new_col
#1 1 0 0 abc
#2 0 4 0 foofoofoofoo
#3 3 2 0 abcabcabcfoofoo
#4 2 1 1 abcabcfooglorx
#5 0 0 2 glorxglorx
Or a simpler version as #thelatemail's comment:
df %>% mutate(new_col = do.call(paste0, Map(strrep, names(.), .)))
Map gives a list as follows:
Map(strrep, names(df), df) %>% as.tibble()
# A tibble: 5 x 3
# abc foo glorx
# <chr> <chr> <chr>
#1 abc
#2 foofoofoofoo
#3 abcabcabc foofoo
#4 abcabc foo glorx
#5 glorxglorx
Use do.call(paste, ...) to paste strings rowwise.

Related

Mutate multiple dataframe columns where cell content depends on column name

I'm trying to replace binary information in dataframe columns with strings that refer to the columns' names.
My data looks like this (just with more natXY columns and some additional variables):
df <- data.frame(id = c(1:5), natAB = c(1,0,0,0,1), natCD = c(0,1,0,0,0), natother = c(0,0,1,1,0), var1 = runif(5, 1, 10))
df
All column names in question start with "nat", mostly followed by two letters although some contain a different number of characters.
For a single column, the following code achieves the desired outcome:
df %>% mutate(natAB = ifelse(natAB == 1, "AB", NA)) -> df
Now I need to generalise this line in order to apply it to the other columns using the mutate() and across() functions.
I imagine something like this
df %>% mutate(across(natAB:natother, ~ ifelse(
. == 1, paste(substr(colnames(.), start = 4, stop = nchar(colnames(.)))), NA))) -> df
... but end up with all my "nat" columns filled with NA. How do I reference the column name correctly in this code structure?
Any help is much appreciated.
You can use cur_column to refer to the column name in an across call, and then use str_remove:
library(stringr)
library(dplyr)
df %>%
mutate(across(natAB:natother,
~ ifelse(.x == 1, str_remove(cur_column(), "nat"), NA)))
# id natAB natCD natother var1
# 1 1 AB <NA> <NA> 7.646891
# 2 2 <NA> CD <NA> 4.704543
# 3 3 <NA> <NA> other 7.717925
# 4 4 <NA> <NA> other 3.367320
# 5 5 AB <NA> <NA> 8.455011

How to select variables with numeric suffixes lower than a value

I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629

Split variable on every other row to form two new columns in data.frame

After scraping a pdf, I have a data frame with a chr text var:
df = data.frame(text = c("abc","def","abc","def"))
My question is how to turn it into:
df = data.frame(text1 = c("abc","abc"),text2=c("def","def"))
I am able to index the rows and manually rebuild a new df, but was curious if it could be done within the dplyr pipe.
All solutions I have been able to find involve splitting each row, but not to split whole rows of a variable into new columns.
Using dplyr you could create a new column (ind) for grouping which would have same values every alternate rows and then we group_by ind and create a sequence column (id) to spread the data into two columns.
library(dplyr)
library(tidyr)
df %>%
mutate(ind = rep(c(1, 2),length.out = n())) %>%
group_by(ind) %>%
mutate(id = row_number()) %>%
spread(ind, text) %>%
select(-id)
# `1` `2`
# <fct> <fct>
#1 abc def
#2 abc def
A base R option would be to split df into separate dataframe every alternate rows creating a sequence using rep and cbind them together to form 2-column data frame.
do.call("cbind", split(df, rep(c(1, 2), length.out = nrow(df))))
# text text
#1 abc def
#3 abc def
We could do this in base R. Use the matrix route to rearrange a vector/column into a matrix and then convert it to data.frame (as.data.frame). As the number of columns is constant i.e. 2, specify that value in ncol
as.data.frame(matrix(df$text, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c('text1', 'text2'))))
# text1 text2
#1 abc def
#2 abc def
Or another option is unstack from base R after creating a sequence of alternate ids (making use of the recycling)
unstack(transform(df, val = paste0('text', 1:2)), text ~ val)
# text1 text2
#1 abc def
#2 abc def
Or we can split into a list of vectors and then cbind it together
as.data.frame(do.call(cbind, split(as.character(df$text), 1:2)))
# 1 2
#1 abc def
#2 abc def
Or another option is dcast from data.table
library(data.table)
dcast(setDT(df), rowid(text)~ text)[, text := NULL][]
data
df <- data.frame(text = c("abc","def","abc","def"))

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

using replace_na() with indeterminate number of columns

My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0

Resources