I have a dataframe containing a line per company, with different variables (some numeric, others not):
data <- data.frame(id=1:5,
CA = c(1200,1500,1550,200,0),
EBE = c(800,50,654,8555,0),
VA = c(6984,6588,633,355,84),
FBCF = c(35,358,358,1331,86),
name=c("qsdf","xdwfq","qsdf","sqdf","qsdfaz"),
weight = c(1, 5, 10,1 ,1))
I would like to summarise all numeric variables by a weighted sum. If I wanted a simple sum I would do:
data %>% summarise_if(is.numeric,sum)
but I don't see how to define a weighted sum.
I tried:
w.sum <- function(x) {sum(x*weight) %>% return()}
but without any success.
We can use it inside the funs
data %>%
summarise_if(is.numeric, funs(sum(.*weight)))
Note that the above is based on the condition that if the columns are numeric class. Based on the example the 'id' column is numeric, which may not need the summariseation. A better option would be summarise_at to specify the columns of interest
data %>%
summarise_at(names(.)[2:5], funs(sum(.*weight)))
Related
I was trying to figure out a way to transform all values in selected columns of my dataset using an equation $$x_i = x_{max} - x_i$$ using dplyr. I'm not sure how to correctly do this for one column, let alone multiple columns. My attempt at mutating 1 column:
df1 <- df %>% mutate(column1 = replace(column1, ., x = max(column1) - x)
My x = max(column1) - x part is not literal, I just want to know how I can implement that equation into all row entries in the column. Furthermore, how can I do this for multiple columns in the same line? Any help is appreciated. Thanks!
If it is to replace all values across multiple columns, loop across the numeric columns and subtract the values from its max value for that column
library(dplyr)
df <- df %>%
mutate(across(where(is.numeric), ~ max(., na.rm = TRUE) - .))
I have got data with observations in rows. There are an outcome variable y (dbl) as well as multiple factors, herein called f_1 and f_2. The latter denote conditions of an experiment. The data situation is mirrored by the following minimal example:
set.seed(123)
y = rnorm(10)
f_1 = factor(rep(c("A", "B"), 5))
f_2 = factor(rep(c("C", "D"), each = 5))
dat <- data.frame(y, f_1, f_2)
I would like to compute mean values of y for groups defined by f_1 and f_2. Importantly, I do not want a mean value for each combination of f_1 and f_2, but mean values based on f_1 on the one hand and mean values values based on f_2 on the other hand. These should be saved as factors in dat, where each observation has a mean_f_1 (mean value when data is grouped according to f_1) and mean_f_2 (mean value when data is grouped according to f_2). The labels of the new factors mean_f_1 and mean_f_2 should correspond to the values = labels of f_1 and f_2. The labels have a meaning. Thus, a mean calculated for group "A" (from f_1) should keep the label "A" (in mean_f_1). The number of condition variables f_... in the original data is higher than 2. Thus, I would like to not repeat code for each factor (see I).
I have come up with two approaches. The first (I; group_by approach) gives the desired result. But repeats code for each factor.
I) group_by approach
library(dplyr)
dat %>%
group_by(f_1) %>%
mutate(mean_f_1 = factor(mean(y), label = unique(f_1))) %>%
group_by(f_2) %>%
mutate(mean_f_2 = factor(mean(y), label = unique(f_2)))
In other words, repeating the 'group_by - mutate' statements for each factor seems avoidable. I did not manage to use across() here.
The other approach (II; ave approach) avoids code repetition, but wont assign factor labels. Assigning factor labels using unique() messed up the order of labels in the original data.
II) ave approach
dat %>% mutate(across(starts_with("f"),
~ ave(y, .x, FUN = mean),
.names = "mean_{.col}"))
Do you have an idea how to ...
... improve (I) to work on multiple factors?
... improve (II) to include factor labels?
... solve the problem differently?
A dplyr solution is preferred.
To avoid repeating code for each factor, I suggest iterating over factors. Something like:
library(dplyr)
factors = c("f_1", "f_2")
for(ff in factors){
new_col = paste0("mean_",ff)
dat <- dat %>%
group_by(!!sym(ff)) %>%
mutate(!!sym(new_col) := factor(mean(y), label = unique(!!sym(ff))))
}
This produces identical output to your group_by approach. To scale up to more columns, add these to the factors array and the code will iterate overthem.
The !!sym(.) is used to turn a character string into a column name. There are several other ways to do this, see the programming with dplyr vignette for other options. The unusual assignment operator := has the same behavior as = except it can accept some prep on the left-hand-side.
I have a bunch of factors that are really ordinals but they're coded as numerics.
This is my code
student_performance <-
read_csv("https://raw.githubusercontent.com/UBC-MDS/ellognea-smwatts-student-performance/master/data/student-math-perf.csv") %>%
as_tibble()
convert.to.ordinals <-
c("Medu",
"Fedu",
"traveltime",
"studytime")
student_perf %>%
mutate_at(vars(convert.to.ordinals), as.factor(ordered = T))
I'm trying to organize them as ordinals and get them to be in ascending order, so it would be the same as doing factor(student_performance$Medu, levels = c(1, 2, 3, 4)) except for all of the ones in the list of variable names
In the newer version of dplyr, we can use across to loop over the column names specified in the vector convert.to.ordinals, apply the function to transform i.e. factor to modify those columns and assign the output back to the original object to change that object
library(dplyr)
student_performance <- student_performance %>%
mutate(across(all_of(convert.to.ordinals), ~
factor(., ordered = TRUE)))
NOTE: The across is a generic way to loop over groups of columns and it replaces the mutate_at or mutate_all, mutate_if with certain changes in the .cols to specify whether a subset of column names with all_of wrapper or select_helpers such as matches, starts_with, ends_with or everything() (mutate_all) or the where (mutate_if)
Or with mutate_at, the key is the lambda function (~ => function(x))
student_performance %>%
mutate_at(vars(convert.to.ordinals), ~ factor(., ordered = TRUE))
I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.
I wonder whether there is a solution for transforming multiple columns within a pipe.
Lets say we have a tibble with three columns. iq_pre and iq_post have to be transormend on log scale and be saved into new columns.
library(tidyverse)
library(magrittr)
df <- tibble(
iq_pre = rnorm(10, 100, 15),
iq_post = rnorm(10, 100, 18),
gender = rep(c("m", "f"), each = 5)
)
I know I could get the result with base R by doing
df[c("iq_pre_lg", "iq_post_lg")] <- log(df[c("iq_pre", "iq_post")])
or looping over the columns with lapply.
The only tidy solution I came up with is to use mutate manually for each column like this
df %<>%
mutate(iq_pre_lg = log(iq_pre),
iq_post_lg = log(iq_post))
Since the names of the columns which should be transformed start with the same letters, I could also use
df %<>%
mutate_at(vars(starts_with("iq")), funs(lg = log(.)))
But what if I want to convert like 20 columns with different names? Is there a way to use purrr::map or maybe even tidyr::nest to solve this in a more elegant way?
We can use
df %>%
mutate_at(vars(matches("iq")), log)
One advantage with matches is that it can take multiple patterns to be matched in a single call. For e.g., if we need to apply the function on columns that start (^) with 'iq' or (|) those end ($) with 'oq', this can be passed into the single matches
df %>%
mutate_at(vars(matches('^iq|oq$'), log)
If the column names are completely different and there are n patterns for the n column, but if there is still some order in the position of columns, then the column position numbers can be passed into the vars. In the current example, the 'iq' columns are the 1st and 2nd columns
df %>%
mutate_at(1:2, log)
Similarly, if the 20 columns occupy the 1st 20 positions
df %>%
mutate_at(1:20, log)
Or if the positions are 1 to 6, 8 to 12, 41:50
df %>%
mutate_at(vars(1:6, 8:12, 41:50), log)