I have several series, each one indicates the deflator for the GDP for each country. (Data attached down below)
So what I want to do is to divide every column for the 97th position.
I know this could be pretty simple for you, but I am struggling.
This is my code so far:
d_data <- d_data %>%
mutate_if(is.numeric, function(x) x/d_data[[97,x]])
So as you can see in the data, from columns 3 to 8 data are numeric.
I think the error is that argument x of the function refers to the column name, while in the d_data, the second argument refers to column position and that is the main issue.
How can I solve this? Thanks in advance!!
Data
Data was massive to put here (745 rows, 8 columns)
So I uploaded the dput(d_data) output here
Use mutate with across as _at/_all are deprecated. Also, to extract by position, use nth
library(dplyr)
d_data %>%
mutate(across(where(is.numeric), ~ .x/nth(.x, 97)))
In the OP's code, instead of d_data[[97,x]], it should be x[97] as x here is the column value itself
d_data %>%
mutate_if(is.numeric, function(x) x/x[97])
If we want to subset the original data column, have to pass either column index or column name. Here, x doesn't refer to column index or name. But with across, we can get the column name with cur_column() e.g. (mtcars %>% summarise(across(everything(), ~ cur_column()))) which is not needed for this case
i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)
In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.
Consider the following dataset.
library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
mutate(g2 = ifelse(gear == 2, 1, 0),
g3 = ifelse(gear == 3, 1, 0),
g4 = ifelse(gear == 4, 1, 0))
Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:
answer = mtcars %>%
summarise(sum_g2 = sum(g2),
sum_g3 = sum(g3),
sum_g4 = sum(g4))
But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:
columns_to_sum = c("g2","g3","g4")
formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")
answer = mtcars %>%
summarise(!!!parse_exprs(formulas))
If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.
Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).
In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.
How do I summarise with a dynamic number of columns (using an rlang approach)?
One option since dplyr 1.0.0 could be:
mtcars %>%
summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
sum_g2 sum_g3 sum_g4
1 0 15 12
Your attempt gives the correct answer but do not give column names as expected.
Here's an approach using map to get the names correct :
library(dplyr)
library(rlang)
library(purrr)
map_dfc(columns_to_sum, ~mtcars %>%
summarise(!!paste0('sum_', .x) := sum(!!sym(.x))))
# sum_g2 sum_g3 sum_g4
#1 0 15 12
You can also use this simple base R approach without any NSE-stuff :
setNames(data.frame(t(colSums(mtcars[columns_to_sum]))),
paste0('sum_', columns_to_sum))
and same in dplyr way :
mtcars %>%
summarise(across(all_of(columns_to_sum), sum)) %>%
set_names(paste0('sum_', columns_to_sum))
This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)