I have a list of data frames each of which contains multiple variables that contain surface area values (ending in "_area"). For each surface area variable there is corresponding conversion factor (ending in “_unit”) that I want to use to calculate a third variable that contains the area in a standard unit of measurement. I want these variables to end in “_area_ha”.
Below are my sample data frames:
a <- tibble(a1_area = c(1,1,1), a2_area_unit = c(1,1,0.5), a2_area = c(1,1,1),
a1_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
b <- tibble(b1_area = c(1,1,1), b1_area_unit = c(1,1,0.5), b2_area = c(1,1,1),
b2_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
ab_list <- list(a, b)
names(ab_list) <- c("a", "b")
I know how to do to this with the help of a loop but would like to understand how this can be done in the tidyverse/dplyr logic. My loop (which gives me the desired output) looks like this:
df_names <- names(ab_list)
for (d in df_names) {
df <- ab_list[[d]]
var_names <- names(select(df, matches("_area$")))
for (v in var_names) {
int <- df %>% select(all_of(v),)
int2 <- df %>% select(matches(paste0(names(int), "_unit")))
int3 <- int*int2
names(int3) <- paste0(names(int), "_ha")
df <- cbind(df, int3)
rm(int, int2, int3)
}
ab_list[[d]] <- tibble(df)
rm(df)
}
> ab_list
$`a`
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
I have tried using lapply and mutate_at but my approach does not work. If I understand correctly, this is because my environment is nested and I cannot access x in the function that calculates the variable "ha".
ab_list %>%
lapply(function(x) mutate_at(x, vars(matches("_area$")), list(ha = ~.*x[[paste0(names(.),"_unit")]])))
Error: Column `a1_area_ha` must be length 3 (the number of rows) or one, not 0
Is there a way to get the function within mutate_at to access a variable from the parent data frame based on the name of initial variable within the function?
I would of course be happy about any other suggestion for a tidyverse approach to calculate the "_ha" variables based on dynamic variable names.
Great question. Below is a base R solution. I am sure it can be adapted to a tidyverse solution (e.g., with purrr::map2()). Here I built a function that does a basic test and then used it with lapply(). Note: the answer is tailored for your example, so you'll need to adapt it if you have different column names for the value / units. Hope this helps!!
val_by_unit <- function(data) {
df <- data[order(names(data))]
# Selecting columns for values and units
val <- df[endsWith(names(df), "area")]
unit <- df[endsWith(names(df), "unit")]
# Check names are multiplying correctly
if(!all(names(val) == sub("_unit", "", names(unit)))) {
stop("Not all areas have a corresponding unit")
}
# Multiplying corresponding columns
output <- Map(`*`, val, unit)
# Renaming output and adding columns
data[paste0(names(output), "_ha")] <- output
data
}
Results:
lapply(ab_list, val_by_unit)
$a
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
The tidyverse functions work best with 'long' formatted data where each of your rows represents a unique data point. To do this, you will want to use the tidyr::pivot_longer function:
# Join dataframes
dplyr::bind_cols(a, b) %>%
# Convert to area columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('area'),
names_to = 'site',
values_to = 'area'
) %>%
# Convert unit columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('unit'),
names_to = 'site2',
values_to = 'unit'
) %>%
# Just extract first 2 characters of the site column to get unique ID
dplyr::mutate(
site = stringr::str_sub(site, 1, 2)
) %>%
# Remove redundant columns
dplyr::select(abc, site, area, unit) %>%
# Calculate area in HA
dplyr::mutate(
area_ha = area * unit
)
Once your data is in long format, you can just use dplyr::mutate to multiply your area column by the unit column to get an area_ha column. If you want to convert your data back to its original format, you can use tidyr::pivot_wider to convert the data back to a wide format, which would give you columns with names a1_area_ha, a2_area_ha, etc.
Related
I want to rearrange a dataframe so that 1) the first column always stays first, and 2) that the second half of the remaining columns are split up to come every second instead.
Note that in the example below
Please see example data:
# Example data
N <- "AB"
l_x_1 <- 1
l_x_2 <- 2
l_x_3 <- 3
# ... not it should be able to handle different number of columns
s_x_1 <- 1
s_x_2 <- 2
s_x_3 <- 3
# ... not it should be able to handle different number of columns (although always equal N of s_ and l_.
# Current state
df <- tibble(N, l_x_1, l_x_2, l_x_3, s_x_1, s_x_2, s_x_3)
df
# What I want (but potential to handle different number of As and Bs)
df <- tibble(N, l_x_1, s_x_1, l_x_2, s_x_2, l_x_3, s_x_3)
df
df[c(1, matrix(seq_along(df)[-1], 2, byrow = TRUE))]
# # A tibble: 1 × 7
# N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB 1 1 2 2 3 3
The idea is to reorder column indices with matrix(). E.g.
c(matrix(1:6, 2, byrow = TRUE))
# [1] 1 4 2 5 3 6
A solution based on stringi::stri_reverse:
cbind(df[1], df[-1][order(stringi::stri_reverse(colnames(df[-1])))])
N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
1 AB 1 1 2 2 3 3
This relies on the fact that if you reverse every names, you recover the desired order:
> sort(stringi::stri_reverse(colnames(df[-1])))
[1] "1_x_l" "1_x_s" "2_x_l" "2_x_s" "3_x_l" "3_x_s"
library(tidyverse)
(preferred_order <- str_split(names(df)[-1],"_",
n = 3, # because 2 underscores
simplify = TRUE) |>
data.frame() |>
arrange(X3,X1) |> # also because 2 underscores
rowwise() |>
mutate(nm=paste0(c_across(),
collapse = "_")) |> pull(nm))
relocate(df,
preferred_order,.after = "N")
I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing it by another.
mytib <- tibble(id = 1:2, value1 = c(4,6), value2 = c(42, 5), total = c(2,2))
myvalues <- c("value1", "value2")
mynames <- c("value1_percent", "value2_percent")
mytib %>%
mutate({{ mynames }} := {{ myvalues }}/total)
Here, I get the error message, which makes me think that the curly-curly operator is misplaced
Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?
I'd like to calculate the percentage columns programmatically (since I have many such columns in my data).
The desired output should be equivalent to this:
mytib %>%
mutate( "value1_percent" = value1/total, "value2_percent" = value2/total)
which gives
# A tibble: 2 × 6
id value1 value2 total value1_percent value2_percent
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 42 2 2 21
2 2 6 5 2 3 2.5
You could use across and construct the new names in its .names argument:
library(dplyr)
mytib %>%
mutate(across(starts_with('value'),
~ .x / total,
.names = "{.col}_percent"
))
I prefer mutate(across(...)) in this case. To make your idea work, try reduce2() from purrr.
library(dplyr)
library(purrr)
reduce2(mynames, myvalues,
~ mutate(..1, !!..2 := !!sym(..3)/total), .init = mytib)
# # A tibble: 2 x 6
# id value1 value2 total value1_percent value2_percent
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 42 2 2 21
# 2 2 6 5 2 3 2.5
The above code is actually a shortcut of:
mytib %>%
mutate(!!mynames[1] := !!sym(myvalues[1])/total,
!!mynames[2] := !!sym(myvalues[2])/total)
I have this kinda simple task I'm having hard time looping.
So, lets assume I have this tibble:
library(tidyverse)
dat <- tibble(player1 = c("aa","bb","cc"), player2 = c("cc","aa","bb"))
My goal here, is to make three new columns ( for each unique "player" I have) and assign value of 1 to the column, if the player is "player1", -1 if the player is "player2" and 0 otherwise.
Previously, I have been doing it like this:
dat %>% mutate( aa = ifelse(player1 == "aa",1,ifelse(player2 == "aa",-1,0)),
bb = ifelse(player1 == "bb",1,ifelse(player2 == "bb",-1,0)),
cc = ifelse(player1 == "cc",1,ifelse(player2 == "cc",-1,0)))
This works, but now I have hundreds of different "players", so it would seem silly to do this manually like that. I have tried and read about loops in R, but I just can't get this one right.
Using model.matrix() from base R:
dat[unique(dat$player1)] <-
model.matrix(~0+ player1, data = dat) - model.matrix(~0+ player2, data = dat)
dat
player1 player2 aa bb cc
<chr> <chr> <dbl> <dbl> <dbl>
1 aa cc 1 0 -1
2 bb aa -1 1 0
3 cc bb 0 -1 1
This assumes you have all players in both columns. Otherwise you would need to convert them to factors with the appropriate levels and replace unique with levels.
We could go from initial structure to "long"(-er) format, with one row per (game, player), recode to 1/-1, and then go wide again with the desired output:
dat %>%
mutate(game_id = row_number()) %>%
gather("role", "player", -game_id) %>%
mutate(role = recode(role, "player1" = 1L, "player2" = -1L)) %>%
spread(player, role, fill = 0L)
#> # A tibble: 3 x 4
#> game_id aa bb cc
#> <int> <int> <int> <int>
#> 1 1 1 0 -1
#> 2 2 -1 1 0
#> 3 3 0 -1 1
You can use pivot_longer() to stack those columns starting with "player" and then pivot it to wide. The advantage is that you can do recoding within pivot_wider() by the argument values_fn.
library(tidyverse)
dat %>%
rowid_to_column("id") %>%
pivot_longer(starts_with("player")) %>%
pivot_wider(names_from = value, names_sort = TRUE,
values_from = name, values_fill = 0,
values_fn = function(x) c(1, -1)[match(x, c("player1", "player2"))])
# # A tibble: 3 x 4
# id aa bb cc
# <int> <dbl> <dbl> <dbl>
# 1 1 1 0 -1
# 2 2 -1 1 0
# 3 3 0 -1 1
Note: Development on gather()/spread() is complete, and for new code we recommend switching to pivot_longer()/_wider(), which is easier to use, more featureful, and still under active development.
This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 3 years ago.
I'm trying to calculate row means on a dataset. I found a helpful function someone made here (dplyr - using mutate() like rowmeans()), and it works when I try out every column but not when I try to use a dplyr helper function.
Why does this work:
#The rowmeans function that works
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
#The data
library(tidyverse)
data <- tibble(id = c(1:4),
turn_intent_1 = c(5, 1, 1, 4),
turn_intent_2 = c(5, 1, 1, 3),
turn_intent_3R = c(5, 5, 1, 3))
#The code that is cumbersome but works
data %>%
mutate(turn_intent_agg = my_rowmeans(turn_intent_1, turn_intent_2, turn_intent_3R))
#The output
# A tibble: 4 x 5
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
But this does not work:
#The code
data %>%
mutate(turn_intent_agg = select(., contains("turn")) %>%
my_rowmeans())
#The output
Error in class1Def#contains[[class2]] : no such index at level 1
Of course, I can type each column, but this dataset has many columns. It'd be much easier to use these wrappers.
I need the output to look like the correct one shown that contains all columns (such as id).
Thank you!
I think that you can simplify it to:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn"))))
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
And you can indeed add also the na.rm = TRUE parameter:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn")), na.rm = TRUE))
I've read the introduction to R's dplyr programming (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html), which is very useful.
I often build quite complex functions which include several sets of grouping variables. For example, given a dataset df, I may want the function to summarise by some variables (let's say grouping variables G1 and G2) and then summarise by some others (let's say G3), and I'll then use these summaries together to produce some final result
df <- data.frame(xV = 1:3,yV=0:2, G1 =c(1,1,0),G2=c(0,0,1),G3=c(1,1,1))
#Within my function I want to calculate
#a)
df%>%group_by(G1,G2)%>%summarise(MEANS1= mean(xV,na.rm=T))
#As well as (b_
df%>%group_by(G3)%>%summarise(MEAN2= mean(xV,na.rm=T))
If I only had to do the first grouping (i.e. (a)) I can build a function, using ...
TAB2<-function(data,x,...){
require(dplyr)
x<-enquo (x)
groupSet1 <- enquos(...)
data%>%group_by(!!!(groupSet1))%>%
summarise(MEAN=mean(!!x,na.rm=T))
}
#Which gives me my results
TAB2(data=df,x=xV,G1,G2)
# A tibble: 2 x 3
# Groups: G1 [2]
G1 G2 MEAN
<dbl> <dbl> <dbl>
1 0 1 3
2 1 0 1.5
But if I want to do both (a) and (b) I need in some way to distinguish between the first and second set of grouping variables (G1, G2) and G3 respectively. I can't do it by just chucking the grouping variables after all the other inputs. Is there any way I can specify these two sets in the input, something along the lines of
TAB3<-function(data,x,y, GroupSet1=c(G1,G2) and GroupSet2=(G3)){
x<-enquo (x)
y<-enquo (x)
#a)
df%>%group_by(GroupSet1)%>%summarise(MEANS1= mean(!!x,na.rm=T))
#As well as (b_)
df%>%group_by(GroupSet2)%>%summarise(MEAN2= mean(!!y,na.rm=T))
}
I have tried to "quote" the two sets in a similar way to x<-enquo(x) in a range of ways but I always get an error. Could you please help? If it was also possible to pass a list of variables as x and y to summarise_at it would also make the function as generic as possible, which would be even better. Basically I'm trying to create a template function that can take several variable sets x and y as well as several group sets, with the aim to produce the mean of the variables in the sets x and y by the corresponding group sets (G1, G2 and G3 respectively).
You can try
TAB3<-function(data, y, grouping_list){
require(tidyverse)
map(grouping_list, ~group_by_at(data, .) %>%
summarise_at(y, list(Mean= mean), na.rm=T)) }
TAB3(df, "xV", list(c("G1", "G2"), c("G3")))
[[1]]
# A tibble: 2 x 3
# Groups: G1 [2]
G1 G2 Mean
<dbl> <dbl> <dbl>
1 0 1 3
2 1 0 1.5
[[2]]
# A tibble: 1 x 2
G3 Mean
<dbl> <dbl>
1 1 2
If you wanted to use the ellipsis as per your TAB2 example, you could try:
update based on new info:
TAB3<-function(df,x,...){
args <- substitute(list(...))
names_env <- setNames(as.list(names(df)), names(df))
arg_list <- eval(args, names_env)
out <- vector(mode = "list", length(arg_list))
for(i in seq_along(arg_list)){
out[[i]] <- df %>% group_by(!!!syms(arg_list[[i]])) %>%
summarise_at(vars(!!!enquos(x)) ,.funs = list(mean=mean), na.rm = T)
}
out
}
TAB3(df, x = c(xV,yV), GroupSet1=c(G1,G2), GroupSet2=G3)
#[[1]]
# A tibble: 2 x 4
# Groups: G1 [2]
# G1 G2 xV_mean yV_mean
# <dbl> <dbl> <dbl> <dbl>
#1 0 1 3 2
#2 1 0 1.5 0.5
#[[2]]
# A tibble: 1 x 3
# G3 xV_mean yV_mean
# <dbl> <dbl> <dbl>
#1 1 2 1