I'm trying to replace binary information in dataframe columns with strings that refer to the columns' names.
My data looks like this (just with more natXY columns and some additional variables):
df <- data.frame(id = c(1:5), natAB = c(1,0,0,0,1), natCD = c(0,1,0,0,0), natother = c(0,0,1,1,0), var1 = runif(5, 1, 10))
df
All column names in question start with "nat", mostly followed by two letters although some contain a different number of characters.
For a single column, the following code achieves the desired outcome:
df %>% mutate(natAB = ifelse(natAB == 1, "AB", NA)) -> df
Now I need to generalise this line in order to apply it to the other columns using the mutate() and across() functions.
I imagine something like this
df %>% mutate(across(natAB:natother, ~ ifelse(
. == 1, paste(substr(colnames(.), start = 4, stop = nchar(colnames(.)))), NA))) -> df
... but end up with all my "nat" columns filled with NA. How do I reference the column name correctly in this code structure?
Any help is much appreciated.
You can use cur_column to refer to the column name in an across call, and then use str_remove:
library(stringr)
library(dplyr)
df %>%
mutate(across(natAB:natother,
~ ifelse(.x == 1, str_remove(cur_column(), "nat"), NA)))
# id natAB natCD natother var1
# 1 1 AB <NA> <NA> 7.646891
# 2 2 <NA> CD <NA> 4.704543
# 3 3 <NA> <NA> other 7.717925
# 4 4 <NA> <NA> other 3.367320
# 5 5 AB <NA> <NA> 8.455011
Related
Goals: To merge multiple columns just based on the similarity of the column name.
Issues: I am dealing with a large data set where the column names are replicated and look like this: wk1.1, wk1.2, wk1.3. For each row, there will be only one value in the similar column names, and the others will be NA. Coalesce is very helpful, but becomes tedious (messes up automation) when I have to list each column name. Is there a way to coalesce based off a string of characters? For instance below, I would prefer to coalesce %in% "wk1."
library(dplyr)
wk1.1 <- c(15, 4, 1)
wk1.2 <- c(3, 20, 4)
wk1.3 <- c(1, 2, 17)
df <- data.frame(wk1.1, wk1.2, wk1.3)
df[df < 14] <- NA
df1 <- df %>%
mutate(wk1 = coalesce(df$wk1.1, df$wk1.2, df$wk1.3))
We can use splice it with !!!
library(dplyr)
df %>%
mutate(wk1 = coalesce(!!! .))
# wk1.1 wk1.2 wk1.3 wk1
#1 15 NA NA 15
#2 NA 20 NA 20
#3 NA NA 17 17
Or another option is to reduce and apply coalesce
library(purrr)
df %>%
mutate(wk1 = reduce(., coalesce))
I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4
I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)
I have a dataframe of this form:
df <- data.frame(abc = c(1, 0, 3, 2, 0),
foo = c(0, 4, 2, 1, 0),
glorx = c(0, 0, 0, 1, 2))
Here, the column names are strings and the values in the data frame are the number of times I would like to concatenate that string in a new data column. The new column I'd like to create would be a concatenation across all existing columns, with each column name being repeated according to the data.
For example, I'd like to create this new column and add it to the dataframe.
new_col <- c('abc', 'foofoofoofoo', 'abcabcabcfoofoo', 'abcabcfooglorx', 'glorxglorx')
also_acceptable <- c('abc', 'foofoofoofoo', 'abcfooabcfooabc', 'abcfooglorxabc', 'glorxglorx')
df %>% mutate(new_col = new_col, also_acceptable = also_acceptable)
The order of concatenation does not matter. The core problem I have is I don't know how to reference the name of a column by row when constructing a purrr::map() or dplyr::mutate() function to build a new column. Thus, I'm not sure how to programatically construct this new column.
(The core application here is combinatorial construction of chemical formulae in case anyone wonders why I would need such a thing.)
Here is an option using Map and strrep:
mutate(df, new_col = do.call(paste, c(sep="", Map(strrep, names(df), df))))
# abc foo glorx new_col
#1 1 0 0 abc
#2 0 4 0 foofoofoofoo
#3 3 2 0 abcabcabcfoofoo
#4 2 1 1 abcabcfooglorx
#5 0 0 2 glorxglorx
Or a simpler version as #thelatemail's comment:
df %>% mutate(new_col = do.call(paste0, Map(strrep, names(.), .)))
Map gives a list as follows:
Map(strrep, names(df), df) %>% as.tibble()
# A tibble: 5 x 3
# abc foo glorx
# <chr> <chr> <chr>
#1 abc
#2 foofoofoofoo
#3 abcabcabc foofoo
#4 abcabc foo glorx
#5 glorxglorx
Use do.call(paste, ...) to paste strings rowwise.
Here is my example
Student <- c('A', 'B', 'B')
Assessor <- c('C', 'D', 'D')
Score <- c(1, 5, 7)
df <- data.frame(Student, Assessor, Score)
df <- dcast(df, Student ~ Assessor,fun.aggregate=(function (x) x), value = 'Score')
print(df)
The output:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
While I want to get something like
C D
A 1 NaN
B NaN 5
B NaN 7
What I am missing?
In addition, if I replace Score with
Score <- c('foo', 'bar','bar')
The output will be:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
Any thoughts?
Since dcast spread among unique values of the left side of the formula I think you can achieve your goal with a (not so elegant hack) but I bet there are other ways to do that with table maybe.
library(reshape2)
dcast(df, Student + Score ~ ...)[-2]
Using Score as value column: use value.var to override.
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7
The hack is to just spread by remaining Student and Score the same and then spread other variables (in this case Assessor) and the with [-2] remove the Score column in order to get the desired output (unless your first column is made by column names actually, which is impossible in base R; in that case you need a data.table solution)
Using the dev version of tidyr (0.3.0) get it from github.
First we complete the combinations of Student/Assessor, then we nest it all into a list, spread and then unnest the list into new rows.
library(dplyr)
library(tidyr)
df %>% complete(Student, Assessor) %>%
nest(Score) %>%
spread(Assessor, Score) %>%
unnest(C) %>%
unnest(D)
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7