This question already has answers here:
cbind a dynamic column name from a string in R
(4 answers)
Closed 3 years ago.
I have a dataframe:
library(tidyverse)
df <- tribble(~col1, ~col2, 1, 2)
Now I want to create a column. I have the name of the new column in a string. It does work like this:
df %>%
mutate("col3" = 3)
# A tibble: 1 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 2 3
But it does not work like this:
newColumnName <- "col3"
df %>%
mutate(newColumnName = 3)
# A tibble: 1 x 3
col1 col2 newColumnName
<dbl> <dbl> <dbl>
1 1 2 3
How do I create a new column that gets its name from a string in an object?
Use !! with the definition operator := as mentioned here, to set a variable name as the column name.
:= supports unquoting on both the LHS and the RHS
library(dplyr)
newColumnName <- "col3"
df %>% mutate(!!newColumnName := 3)
# A tibble: 1 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 2 3
Related
I have the following table:
col1
col2
col3
col4
1
2
1
4
5
6
6
3
My goal is to find the max value per each row, and then find how many times it was repeated in the same row.
The resulting table should look like this:
col1
col2
col3
col4
max_val
repetition
1
2
1
4
4
1
5
6
6
3
6
2
Now to achieve this, I am doing the following for Max:
df%>% rowwise%>%
mutate(max=max(col1:col4))
However, I am struggling to find the repetition. My idea is to use this pseudo code in mutate:
sum( "select current row entirely or only for some columns"==max). But I don't know how to select entire row or only some columns of it and use its content to do the check, i.e.: is it equal to the max. How can we do this in dplyr?
A dplyr approach:
library(dplyr)
df %>%
rowwise() %>%
mutate(max_val = max(across(everything())),
repetition = sum(across(col1:col4) == max_val))
# A tibble: 2 × 6
# Rowwise:
col1 col2 col3 col4 max_val repetition
<int> <int> <int> <int> <int> <int>
1 1 2 1 4 4 1
2 5 6 6 3 6 2
An R base approach:
df$max_val <- apply(df,1,max)
df$repetition <- rowSums(df[, 1:4] == df[, 5])
For other (non-tidyverse) readers, a base R approach could be:
df$max_val <- apply(df, 1, max)
df$repetition <- apply(df, 1, function(x) sum(x[1:4] == x[5]))
Output:
# col1 col2 col3 col4 max_val repetition
# 1 1 2 1 4 4 1
# 2 5 6 6 3 6 2
Although dplyr has added many tools for working across rows of data, it remains, in my mind at least, much easier to adhere to tidy principles and always convert the data to "long" format for these kinds of operations.
Thus, here is a tidy approach:
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(max_val = max(value), repetitions = sum(value == max(value))) %>%
pivot_wider(id_cols = c(row, max_val, repetitions)) %>%
select(col1:col4, max_val, repetitions)
The last select() is just to get the columns in the order you want.
I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing it by another.
mytib <- tibble(id = 1:2, value1 = c(4,6), value2 = c(42, 5), total = c(2,2))
myvalues <- c("value1", "value2")
mynames <- c("value1_percent", "value2_percent")
mytib %>%
mutate({{ mynames }} := {{ myvalues }}/total)
Here, I get the error message, which makes me think that the curly-curly operator is misplaced
Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?
I'd like to calculate the percentage columns programmatically (since I have many such columns in my data).
The desired output should be equivalent to this:
mytib %>%
mutate( "value1_percent" = value1/total, "value2_percent" = value2/total)
which gives
# A tibble: 2 × 6
id value1 value2 total value1_percent value2_percent
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 42 2 2 21
2 2 6 5 2 3 2.5
You could use across and construct the new names in its .names argument:
library(dplyr)
mytib %>%
mutate(across(starts_with('value'),
~ .x / total,
.names = "{.col}_percent"
))
I prefer mutate(across(...)) in this case. To make your idea work, try reduce2() from purrr.
library(dplyr)
library(purrr)
reduce2(mynames, myvalues,
~ mutate(..1, !!..2 := !!sym(..3)/total), .init = mytib)
# # A tibble: 2 x 6
# id value1 value2 total value1_percent value2_percent
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 42 2 2 21
# 2 2 6 5 2 3 2.5
The above code is actually a shortcut of:
mytib %>%
mutate(!!mynames[1] := !!sym(myvalues[1])/total,
!!mynames[2] := !!sym(myvalues[2])/total)
I have a list of data frames each of which contains multiple variables that contain surface area values (ending in "_area"). For each surface area variable there is corresponding conversion factor (ending in “_unit”) that I want to use to calculate a third variable that contains the area in a standard unit of measurement. I want these variables to end in “_area_ha”.
Below are my sample data frames:
a <- tibble(a1_area = c(1,1,1), a2_area_unit = c(1,1,0.5), a2_area = c(1,1,1),
a1_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
b <- tibble(b1_area = c(1,1,1), b1_area_unit = c(1,1,0.5), b2_area = c(1,1,1),
b2_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
ab_list <- list(a, b)
names(ab_list) <- c("a", "b")
I know how to do to this with the help of a loop but would like to understand how this can be done in the tidyverse/dplyr logic. My loop (which gives me the desired output) looks like this:
df_names <- names(ab_list)
for (d in df_names) {
df <- ab_list[[d]]
var_names <- names(select(df, matches("_area$")))
for (v in var_names) {
int <- df %>% select(all_of(v),)
int2 <- df %>% select(matches(paste0(names(int), "_unit")))
int3 <- int*int2
names(int3) <- paste0(names(int), "_ha")
df <- cbind(df, int3)
rm(int, int2, int3)
}
ab_list[[d]] <- tibble(df)
rm(df)
}
> ab_list
$`a`
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
I have tried using lapply and mutate_at but my approach does not work. If I understand correctly, this is because my environment is nested and I cannot access x in the function that calculates the variable "ha".
ab_list %>%
lapply(function(x) mutate_at(x, vars(matches("_area$")), list(ha = ~.*x[[paste0(names(.),"_unit")]])))
Error: Column `a1_area_ha` must be length 3 (the number of rows) or one, not 0
Is there a way to get the function within mutate_at to access a variable from the parent data frame based on the name of initial variable within the function?
I would of course be happy about any other suggestion for a tidyverse approach to calculate the "_ha" variables based on dynamic variable names.
Great question. Below is a base R solution. I am sure it can be adapted to a tidyverse solution (e.g., with purrr::map2()). Here I built a function that does a basic test and then used it with lapply(). Note: the answer is tailored for your example, so you'll need to adapt it if you have different column names for the value / units. Hope this helps!!
val_by_unit <- function(data) {
df <- data[order(names(data))]
# Selecting columns for values and units
val <- df[endsWith(names(df), "area")]
unit <- df[endsWith(names(df), "unit")]
# Check names are multiplying correctly
if(!all(names(val) == sub("_unit", "", names(unit)))) {
stop("Not all areas have a corresponding unit")
}
# Multiplying corresponding columns
output <- Map(`*`, val, unit)
# Renaming output and adding columns
data[paste0(names(output), "_ha")] <- output
data
}
Results:
lapply(ab_list, val_by_unit)
$a
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
The tidyverse functions work best with 'long' formatted data where each of your rows represents a unique data point. To do this, you will want to use the tidyr::pivot_longer function:
# Join dataframes
dplyr::bind_cols(a, b) %>%
# Convert to area columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('area'),
names_to = 'site',
values_to = 'area'
) %>%
# Convert unit columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('unit'),
names_to = 'site2',
values_to = 'unit'
) %>%
# Just extract first 2 characters of the site column to get unique ID
dplyr::mutate(
site = stringr::str_sub(site, 1, 2)
) %>%
# Remove redundant columns
dplyr::select(abc, site, area, unit) %>%
# Calculate area in HA
dplyr::mutate(
area_ha = area * unit
)
Once your data is in long format, you can just use dplyr::mutate to multiply your area column by the unit column to get an area_ha column. If you want to convert your data back to its original format, you can use tidyr::pivot_wider to convert the data back to a wide format, which would give you columns with names a1_area_ha, a2_area_ha, etc.
I can't find an exact answer to this problem, so I hope I'm not duplicating a question.
I have a dataframe as follows
groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2
What I'm trying to convey with this is that there are duplicate IDs where the total information is spread across both rows and I want to combine these rows to get all the information into one row. How do I go about this?
I've tried to play around with group_by and paste but that ends up making the data messier (getting 22 instead of 2 in col4 for example) and sum() does not work because some columns are strings and those that are not are categorical variables and summing them would change the information.
Is there something I can do to collapse the rows and leave consistent data unchanged while filling in NAs?
EDIT:
Sorry desired output is as follows:
groupid col1 col2 col3 col4
1 0 n 2 2
Is this what you want ? zoo+dplyr also check the link here
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))%>%filter(row_number()==n())
# A tibble: 1 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n 2 2
EDIT1
without the filter , will give back whole dataframe.
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))
# A tibble: 2 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n NA 2
2 1 0 n 2 2
filter here, just slice the last one, na.locf will carry on the previous not NA value, which mean the last row in your group is what you want.
Also base on # thelatemail recommended. you can do the following , give back the same answer.
df %>% group_by(groupid) %>% summarise_all(funs(.[!is.na(.)][1]))
EDIT2
Assuming you have conflict and you want to show them all.
df <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 1 NA 2 2",
header=TRUE,stringsAsFactors=FALSE)
df
groupid col1 col2 col3 col4
1 1 0 n NA 2
2 1 1(#)<NA> 2 2(#)
df %>%
group_by(groupid) %>%
summarise_all(funs(toString(unique(na.omit(.)))))#unique for duplicated like col4
groupid col1 col2 col3 col4
<int> <chr> <chr> <chr> <chr>
1 1 0, 1 n 2 2
Another option with just dplyr is just to take the first non-NA value when available. You can do
dd <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2", header=T)
dd %>%
group_by(groupid) %>%
summarise_all(~first(na.omit(.)))
Would you be able to draw the desired output in this case? Converting data.frame into anothre type as.vector(), as.matrix() and grouping/factoring might help.
UPDATE:
Finding a unique elements for each column and omitting NAs.
df<-data.frame(groupid=c(1,1), col1=c(0,NA), col2=c('n', NA), col3=c(NA,2), col4=c(2,2)) # your input
out<-data.frame(df[1,]) # where the output is stored, duplicate retaining 1 row
for(i in 1:ncol(df)) out[,i]<-na.omit(unique(df[,i]))
print(out)
I am having trouble using the tidyr::complete() function with column names as variables.
The built-in example works as expected:
df <- data_frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name))
However, when I try to provide the column names as character strings, it produces an error.
gr="group"
id="item_id"
name="item_name"
df %>% complete_(gr, nesting_(id, name),fill = list(NA))
Even a little more simply, df %>% complete(!!!syms(gr), nesting(!!!syms(id), !!!syms(name))) now gets it done in tidyr 1.0.2
I think it's a bug that complete_ can't work with data.frames or list columns like complete can, but here's a workaround using unite_ and separate to simulate nesting:
df %>% unite_('id_name', c(id, name)) %>%
complete_(c(gr, 'id_name')) %>%
separate(id_name, c(id, name))
## # A tibble: 4 × 5
## group item_id item_name value1 value2
## * <dbl> <chr> <chr> <int> <int>
## 1 1 1 a 1 4
## 2 1 2 b 3 6
## 3 2 1 a NA NA
## 4 2 2 b 2 5
Now that tidyr has adopted tidy evaluation, the underscore variants (i.e. complete_) have been deprecated since their behavior can be handled by the standard variants (complete).
However, complete, crossing and nesting use data-masking, so the way to convert variables into names is via the .data[[var]] pronoun (per the docs), so your case becomes:
suppressPackageStartupMessages(
library(tidyr)
)
df <- data.frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
gr <- "group"
id <- "item_id"
name <- "item_name"
df %>% complete(
.data[[gr]],
nesting(.data[[id]],
.data[[name]])
)
#> # A tibble: 4 x 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <fct> <int> <int>
#> 1 1 1 a 1 4
#> 2 1 2 b 3 6
#> 3 2 1 a NA NA
#> 4 2 2 b 2 5
Created on 2020-02-28 by the reprex package (v0.3.0)
Not very elegant, but it gets the job done.