Mutate to modify values and replace - r

Hi there I am trying to mutate values (e.g. changing kilograms to tonnes) and replace them in the original dataset but it doesn't seem to remain within the original dataset.
Here is a sample dataset for reference.
Country
Type
Quantity
A
Kilograms
23132
B
Kilograms
34235
C
Tonnes
700
library(dplyr)
df %>%
filter(Type == "Kilograms") %>%
group_by(Quantity) %>%
mutate(Quantity = Quantity /1000)
But I am not sure what to do the for next step, I tried the replace function but it didn't work.
Also, I plan to add a line at the end that changes all kilograms to tonnes, something like this:
df$Unit[df$Type == 'Kilograms'] <- 'Tonnes'

You can also use case_when() which is dplyr's equivalent to SQL's CASE WHEN. Basically it allows you to vectorize multiple if_else() statements. Below, the first condition is the if statement and then TRUE ~ is the else statement
df <- data.frame(Country = c('A', 'B', 'C'),
Type = c("Kilograms", "Kilograms", "Tonnes"),
Quantity = c(23132, 34235, 700))
df <- df %>%
mutate(Quantity = case_when(Type == 'Kilograms' ~ Quantity/1000,
TRUE ~ Quantity),
Type = case_when(Type == 'Kilograms' ~ 'Tonnes',
TRUE ~ 'Tonnes')
)

use ifelse function to change the value based on other condition. This function also works weel with tidyverse environment.
Don't forget to reassign the result to original variable since pipe operator does not change the input data
library(dplyr)
df = df %>% mutate(Quantity = ifelse(Type=="Kilograms",Quantity/1000,Quantity),
Type = ifelse(Type=='Kilograms','Tonnes',Type))

Related

Check if single column is equal to any multiple others

My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!
If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))
I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))

mutate cells of a range of columns if the column name is in another column

I have a huge dataset where I would like to change a cell value in a range of columns, if the column name is in another column.
I know I can loop through cells, and use ifelse, but this becomes very slow very soon, it seems. I got as far as using mutate() and across() but cannot work out how to make a logical with the column name.
I would be grateful if someone could suggest a vectorized approach, or point me to a similar question (which I was unable to find!), using tidyverse if possible.
Example of a dataset and the nested for loops:
a <- c(1,2,3,4)
b <- c(5,6,7,8)
c <- c(9,10,11,12)
d <- c("a","b","c","none")
test <- data.frame(a,c,b,d)
for(column in 1:3){
for(row in 1:nrow(test)){
test[row,column] <- ifelse(names(test)[column] == test$d[row], -99, test[row, column])
}
}
I found the solution to my own question in using current_col() which gives the name of the current column in an across()function, using ifelse().
test %>% mutate(across(c(a, b, c), ~ifelse(cur_column() == d, -99, .)))
You could do this for every column of interest as well as any reference column in your dataset.
library(tidyverse)
test %>%
mutate(a = case_when(
d == names(test)[1] ~ -99,
T ~ a
))
You could then add a new mutate, or include it in the same mutate, per "target" column (i.e.)
test %>%
mutate(a = case_when(
d == names(test)[1] ~ -99,
T ~ a
)) %>%
mutate(b = case_when(
d == names(test)[2] ~ -99,
T ~ b
))
If you have multiple source columns (i.e. Columns like d, then you would need to add new rows to your mutates that account for that column, however since your test does not include that I won't get into it unless required.

How to combine mutate when the condition is the same but output is different?

For my data frame, in the column Unit, if "mg" is found, it is replaced with "g" and then the corresponding value in the column "Mass" is divided by 1000. I used mutate twice to achieve this. Is there any ways to combine the two mutate into one?
df %>% mutate(unit = case_when(unit == "mg" ~ "g"))
df %>% mutate(mass = case_when(unit == "mg" ~ mass / 1000))
We can include several transformations (in this case the two transformations) inside the same mutate call, with a reversed order. If your case_when statement is that simple, ifelse is enough:
library(dplyr)
df %>% mutate(mass = ifelse(unit == 'mg', mass / 1000, mass),
unit = ifelse(unit == 'mg', 'g', unit))
Create the logical condition as a column and reuse. As the replacement values are different, it is better to have it separately. Also, case_when by default changes the rest of the elements to NA. If the OP meant to keep the rest of the values from the original column, specify the TRUE ~ condition
library(dplyr)
df <- df %>%
mutate(i1 = unit == 'mg',
unit = case_when(i1 ~ 'g', TRUE ~ unit),
mass = case_when(i1~ mass/1000, TRUE ~ mass), i1 = NULL)

R function with list of variables of unknown length

trying to branch out an learn some R, one thing I do often at my job is I pull weighted means by some time specific period variable. I figured out how to do that individually like this:
means_by_period <- df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, wgtvar),
var2 = weighted.mean(var2, wgtvar),
var3 = weighted.mean(var3, wgtvar),
var4 = weighted.mean(var4, wgtvar)
)
We do this all the time but I am not always going to know how many variables/what variables I am going to be pulling and it would be a pain to edit this code every time, so I built an excel sheet to do it for me, but this seems like a good opportunity to learn how to write a function to do it. Problem is I am not sure how to write it such that it will work. I know my arguments will be: 1. the current data set 2. the period 3. the weighted variable 4. a concatenated vector of my variables?
newfunction <- function(df, period, weight, variables)
{df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, weight),
var2 = weighted.mean(var2, weight),
var3 = weighted.mean(var3, weight),
var4 = weighted.mean(var4, weight) )
}
I am like 2 weeks into learning so if anyone could give me some pointers on what I'd need to do here that would be great. Thanks!
If the 'var1', 'var2', 'var3', 'var4' are a vector of column names (as strings in the 'variables', then we can convert to symbol and evaluate (!!)
library(dplyr)
newfunction <- function(df, period, weight, variables) {
df %>%
group_by({{period}}) %>%
summarize(
!! variables[1] := weighted.mean( !! rlang::sym(variables[1]), {{weight}}),
!! variables[2] := weighted.mean( !! rlang::sym(variables[2]), {{weight}}),
!! variables[3] := weighted.mean( !! rlang::sym(variables[3]), {{weight}}),
!! variables[4] := weighted.mean( !! rlang::sym(variables[4]), {{weight}}) )
}
Here, the column names for 'period', 'weight' are assumed to be passed as unquoted, while the 'variables' as a vector of strings
As the OP mentioned that 'variables' can be of unknown length, we can loop over the vector of column names ('variables') in map
library(purrr)
newfunction2 <- function(df, period, weight, variables) {
map(variables, ~ df %>%
group_by({{period}}) %>%
summarise(!! .x := weighted.mean(!! rlang::sym(.x), {{weight}}))) %>%
reduce(full_join)
}

Dplyr Non Standard Evaluation -- Help Needed

I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.
You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)
Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))

Resources