mutate_at - using a function with map2 - r

My objective is to convert a set monthly revenue columns from AUD to USD. To achieve this, I need to apply a different exchange rate to each of the revenue columns.
data for analysis:
pacman::p_load(lubridate, purrr, dplyr)
df1 <- data.frame(
Date = seq(dmy("01/01/2017"), by = "day", length.out = 3),
Customer = "a",
Product = "xxx",
Revenue1 = c(10, 20, 30),
Revenue2 = c(100, 200, 300))
df2 <- data.frame(Factor1 = c(10),
Factor2 = c(20))
df3 <- select(df1, Revenue1:Revenue2)
This is my function
fx_adjust <- function(x, y = df2){map2_df(x, y, ~ .x * .y)}
These two work:
fx_adjust(df3, df2)
mutate_at(df1, vars(contains("Revenue")), funs(. * 10))
But this does not work:
mutate_at(df1, vars(contains("Revenue")), funs(fx_adjust(.)))
Could someone kindly explain why mutate_at is misbehaving.

This is because mutate_at calls your function separately for each column. It does not pass all the columns at once in the .
Observe this example
fx_dump<-function(...) print(list(...))
mutate_at(df1, vars(contains("Revenue")), funs(fx_dump(.)))
You'll see that fx_dump is called twice, once for each column. You cannot pass multiple parameters at a time to your function using mutate_at.

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

Is there a way of using the pmap function in purrr to iteratively adjust a dataset?

I've created a function which I am trying to apply to a dataset using pmap. The function I've created amends some columns in a dataset. I want the amendment that's applied to the two columns to carry over to the 2nd and subsequent iterations of pmap.
Reproducible example below:
library(tidyr)
library(dplyr)
set.seed(1982)
#create example dataset
dataset <- tibble(groupvar = sample(c(1:3), 20, replace = TRUE),
a = sample(c(1:10), 20, replace = TRUE),
b = sample(c(1:10), 20, replace = TRUE),
c = sample(c(1:10), 20, replace = TRUE),
d = sample(c(1:10), 20, replace = TRUE)) %>%
arrange(groupvar)
#function to sum 2 columns (col1 and col2), then adjust those columns such that the cumulative sum of the two columns
#within the group doesn't exceed the specified limit
shared_limits <- function(col1, col2, group, limit){
dataset <- dataset
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
#apply function directly
new_dataset <- shared_limits("a", "b", "groupvar", 25)
#apply function using a separate parameters table and pmap_dfr
shared_limits_table <- tibble(col1 = c("a","b"),
col2 = c("c","d"),
group = "groupvar",
limit = c(25, 30))
dataset <- pmap_dfr(shared_limits_table, shared_limits)
In the example above the pmap function applies the shared limit to columns "a" and "c" and returns an adjusted dataset as the first element in the list. It then applies the shared limit to columns "b" and "d" and returns this as the second element in the list. However the adjustments that have been made to "a" and "c" are now lost.
Is there any way of storing the adjustments that are made to each column as we progress through each iteration of pmap?
You can iteratively apply a function to your dataset with reduce
First, I'd fix your function since dataset is undefined
shared_limits <- function(df, col1, col2, group, limit){
dataset <- df
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
Then make a list of the arguments you want to pass to the function at each step
shared_limits_args_list <- list(
list("a", "c", "groupvar", 25),
list("b", "d", "groupvar", 30))
Then call reduce, setting the dataset as your initial x with the .init parameter. At each iteration a sublist of arguments from shared_limits_args_list will be passed to the function as y. [[ is used to select the list elements for each position. The output dataframe from the function will become the new x for the next iteration, and the next sublist of shared_limits_args_list will be the next set of arguments. When all of the sublists of shared_limits_args_list have been used, the final dataframe is output.
dataset_combined <-
reduce(shared_limits_args_list,
function(x,y) shared_limits(df=x, y[[1]], y[[2]], y[[3]], y[[4]]),
.init=dataset)

merging time series datasets, but missing the time series column

I've tried methods from others for merging time series datasets. However, The time series column is missing. Please see captured screen.
Here is the example of my datasets.
df1 = data.frame(Time = round(seq(1, 200, length.out= 50)), Var1 = runif(50,1, 10))
df2 = data.frame(Time = round(seq(1, 200, length.out= 80)), Var2 = runif(80,1, 10))
df3 = data.frame(Time = round(seq(1, 200, length.out= 100)), Var3 = runif(100,1, 10))
Here is what I've tried.
a = read.zoo(df1,drop = FALSE)
b = read.zoo(df2,drop = FALSE)
c = read.zoo(df3,drop = FALSE)
abc = merge(a, b, c)
How can I add one first column listing the Time? Any comments about this task that I can learn from you?
Thanks.
This converts all three data frames to zoo and merges them into a combined zoo object.
z <- do.call("merge", lapply(list(df1, df2, df3), read.zoo, drop = FALSE))
Note that in zoo objects the time is stored in the index attribute. It is not a column. The statement shown above already includes the time as derived from the first columns of each of the data frames.

Conditionally applying factor values from one dataframe to another

I have the following two data frames:
letters <- LETTERS[seq(from = 1, to = 5)]
values <- rnorm(5, mean = 50)
df1 <- data.frame(letters, values)
category <- sample(LETTERS[1:5], 20, replace = TRUE)
numbers <- rnorm(20, mean = 100)
df2 <- data.frame(category, numbers)
I want to create a new column in df2 that takes the value in df2$numbers and subtracts the value in df1$values based on the matching letter.
In other words, if the value for "C" in df1 is 49.2, I want to subtract 49.2 from every row in df2$numbers where df$category equals "C". Hope that makes sense. Thanks for the help!
With dplyr:
df <- full_join(df1, df2, by = c('letters' = 'category')) %>%
mutate(diff = numbers - values)

Aggregate an entire data frame with Weighted Mean

I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)

Resources