mutate_at using function lag but keep first row - r

I am using dplyr in R (with great joy) and want to get the differential of the columns mpg to gear in mtcars. The first row then returns NA (for obvious reason). Instead of this first row being NA I would like it to stay the original value.
I am looking for a clean and efficient way to achieve this (not using join to add the first row to the differntiated values since the code on my own dataset contains many filters and grouped variables).
my code is as follows:
mtcars %>% mutate_at(vars(mpg:gear), funs(. - lag(., 1)))
I expect the first row to be mtcars[1] and the rest to be the differential

We can specify the default parameter with 0, otherwise, it would be NA
library(dplyr)
mtcars %>%
mutate_at(vars(mpg:gear), list(~ . - lag(., default = 0)))
Or another option is diff with concatenating the first element
mtcars %>%
mutate_at(vars(mpg:gear), list(~ c(first(.), diff(.))))
NOTE: The funs is getting deprecated. Instead of that we are using list

Related

Divide by a certain position in R

I have several series, each one indicates the deflator for the GDP for each country. (Data attached down below)
So what I want to do is to divide every column for the 97th position.
I know this could be pretty simple for you, but I am struggling.
This is my code so far:
d_data <- d_data %>%
mutate_if(is.numeric, function(x) x/d_data[[97,x]])
So as you can see in the data, from columns 3 to 8 data are numeric.
I think the error is that argument x of the function refers to the column name, while in the d_data, the second argument refers to column position and that is the main issue.
How can I solve this? Thanks in advance!!
Data
Data was massive to put here (745 rows, 8 columns)
So I uploaded the dput(d_data) output here
Use mutate with across as _at/_all are deprecated. Also, to extract by position, use nth
library(dplyr)
d_data %>%
mutate(across(where(is.numeric), ~ .x/nth(.x, 97)))
In the OP's code, instead of d_data[[97,x]], it should be x[97] as x here is the column value itself
d_data %>%
mutate_if(is.numeric, function(x) x/x[97])
If we want to subset the original data column, have to pass either column index or column name. Here, x doesn't refer to column index or name. But with across, we can get the column name with cur_column() e.g. (mtcars %>% summarise(across(everything(), ~ cur_column()))) which is not needed for this case

Using the R syntax sequence operator ":" within the the sum command with more then 50 columns

i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)

mutate and/or summarise a dynamic number of columns

In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.
Consider the following dataset.
library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
mutate(g2 = ifelse(gear == 2, 1, 0),
g3 = ifelse(gear == 3, 1, 0),
g4 = ifelse(gear == 4, 1, 0))
Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:
answer = mtcars %>%
summarise(sum_g2 = sum(g2),
sum_g3 = sum(g3),
sum_g4 = sum(g4))
But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:
columns_to_sum = c("g2","g3","g4")
formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")
answer = mtcars %>%
summarise(!!!parse_exprs(formulas))
If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.
Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).
In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.
How do I summarise with a dynamic number of columns (using an rlang approach)?
One option since dplyr 1.0.0 could be:
mtcars %>%
summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
sum_g2 sum_g3 sum_g4
1 0 15 12
Your attempt gives the correct answer but do not give column names as expected.
Here's an approach using map to get the names correct :
library(dplyr)
library(rlang)
library(purrr)
map_dfc(columns_to_sum, ~mtcars %>%
summarise(!!paste0('sum_', .x) := sum(!!sym(.x))))
# sum_g2 sum_g3 sum_g4
#1 0 15 12
You can also use this simple base R approach without any NSE-stuff :
setNames(data.frame(t(colSums(mtcars[columns_to_sum]))),
paste0('sum_', columns_to_sum))
and same in dplyr way :
mtcars %>%
summarise(across(all_of(columns_to_sum), sum)) %>%
set_names(paste0('sum_', columns_to_sum))

How to compute a column that depends on a function that uses the value of a variable of each row?

This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)

Error dplyr summarise

I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')

Resources