I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.
My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.
For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.
Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.
userInput_prob1 <- 0.95
userInput_prob2 <- 0.9
# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){
# Define new column name to hold the MPG percentile cutoff.
probColName <- paste0('P', pctCutoff*100)
# Compute the MPG percentile cutoff by number of gears.
MPGCutoffs <- mtcars %>%
dplyr::group_by( gear ) %>%
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
# Filter mtcars with only MPG values above cutoffs.
output <- mtcars %>%
dplyr::left_join( MPGCutoffs, by='gear' ) %>%
dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck
# Return filtered data.
return(output)
}
best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )
The dplyr::filter() statement is what I can't get to run properly. I've tried:
dplyr::filter( mpg > probColName ) - No error, but no rows returned.
dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.
I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.
Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.
If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change
dplyr::filter( mpg > !!rlang::sym(probColName) )
and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116
It's still fine to use
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.
Here's an alternate solution from Hadley's comment in the post referred to in MrFlick's answer (https://github.com/tidyverse/rlang/issues/116). Using as.name() from base R takes the place of rlang::sym(), and you still do need to unquote it. That is, the following also works:
dplyr::filter( mpg > !!as.name(probColName) )
Related
I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?
If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.
First of all I have checked the existing topics. Unfortunately, they are either not exactly relevant or I am not able to understand them. As you will know from my type of question, I'm VERY new to R. I hope this is okay...
I feel I am on the right way....
here https://i.stack.imgur.com/5jv0m.jpg is an excerpt of the dataframe (df)
I want to compare whether the values of the subcategories of emissions (y) sum up the values stated in the parent categories. Part of this is summing up the values of subcategories.
In short I want to know whether sum(3.B.1+3.B.2+...+3.B.n) = 3.B. (i.e. the in the csv stated sum) for a given year and country. I want to verify the sums.
I've tried this code (with 2010 and Austria):
sum(compare_df, x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5") & x
== "2010" & x2 == "Austria")
but get this:
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
After having this, is there a way to run a code which will automate the process of running code for other conditions (i.e. list of countries and years)? You some keywords would be helpful here. I could then search for it myself.
I hope my question is clear enough and thank you for any sort of help or suggestion. Sorry for such a long post...
PS: I've updated everything know and hope my question is more clear.
If you want to verify the sums of the y variable you need to specify which variable you want to sum. Currently your sum statement is trying to sum the whole data.frame and when it encounters a categorical variable it throws the error
Error in FUN(X[[i]], ...) : only defined on a data frame with all
numeric variables
I didn't reproduce your code but this can be verified by sum(iris). If you truly want to sum all numeric variables you would have to do this sum(iris[sapply(iris,is.numeric)]).
But to get to your question about subsetting on three variables you would have to do something like this:
sum(iris$Sepal.Length[iris$Species %in% c("setosa","versicolor") &
iris$Sepal.Width >= 3 &
iris$Petal.Length >= 2])
First you have to tell sum what data.frame and variable you want to sum over e.g(the iris$Sepal.Length part of the code - this would be your df$y) then with [ you need to subset on the variables of interest. In your code when you refer variables without the df$ notation R will not find those variables because they are not objects but rather part of the data.frame. Hope this helps.
Also in your post your year variable is a numeric and not a categorical variable so you should remove the quotes around 2010.
Hard to be sure without knowing what compare_df looks like but here is a possible solution using dplyr which is great for working with data frames.
The %>% operator is the 'pipe' which takes the results of the previous function and inserts them into the first argument of the subsequent function.
All of the dplyr functions (filter, group_by, summarize, etc) take the data as the first function argument so it works nicely with %>%.
library(dplyr)
compare_df %>%
filter(x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5"))
group_by(x, x2) %>%
summarize(sum_emmissions = sum(y, na.rm = TRUE)) %>%
filter(x == "2010", x2 == "Austria")
Ok so I cannot figure this out for the life of me, I want to filter my data based on a partial string match. here is my data, I am just showing the column i want to filter, but there are more rows in the overall set. I only want to show the rows that begin with "CAO" --this is easily achievable in the viewer
dataviewer image:
Basically I want the R "code" that would reproduce this exact result. I have tried using grepl like so
filter(longdata, grepl("^CAO",longdata[,1]))
I have tried using subset
subset(longdata,longdata[,1]=="^CAO")
I have tried subset with grepl and no matter what I do I cant figure it out. I am new to R so please try and explain it thoroughly.
The second argument of grepl wasn´t recognized in your first code
library(tidyverse) #in this case access to dplyr and to tibble´s data_frame() function which preserves the spaces in the column names
longdata <- data_frame(`Issue ID`=c("CAO-2017-20", "CAO-2017-20", "CAO-2017-20", "AO-2017-20", "CA-2017-20"))
longdata %>% filter(grepl("CAO", `Issue ID`)) #patern "^CAO" also works
%>% is a piping operator that passes the outcomes of the previous operations further, here it´s loaded by dplyr.
Basically what I did was to load the tidyverse set of packages (read more on tidyverse here). Those ones of interest are tibble and dplyr.
Then I created a sample data frame with tibble´s function data_frame()
Then I applied an adjusted function that you suggested, namely
filter(longdata, grepl("^CAO",`Issue ID`))
which is the same in its piped form:
longdata %>% filter(grepl("CAO", `Issue ID`))
Like the following source code about aggregation function, I can't understand why we have to use list function() in here. Rather than I want to replace this with using one column that is needs to be grouped by. And I don't know why we use the same dataset like 'train[Sales != 0]' twice? What if I use other dataset as a second dataset param? I think it will make change to be fairly high possible mistake.
aggregate(train[Sales != 0]$Sales,
by = list(train[Sales != 0]$Store), mean)
Maybe one who can say this is wrong use case. But I also saw this source code in R Documentation
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
Thanks for reading my question.
First of all, if you don't like the syntax of the aggregate function, you could take a look at the dplyr package. Its syntax might be a bit easier for you.
To answer your questions:
The second argument is just expected to be a list, so you can add multiple variables.
You have to use train[Sales != 0] two times, because otherwise the first and the by argument look at different indices. You could also make a subset first:
Base R-code:
trainSales <- train[Sales != 0]
aggregate( trainSales$Sales, by = list(trainSales$Store), mean )
With dplyr you could do something like this:
train %>%
filter( Sales != 0) %>%
group_by( Store ) %>%
summarise_each( funs(mean) )
You see I use summarise_each because it condenses the dataset to one row, but you could off course also do something that leaves all the rows intact (in that case, use do).
I have a dataset with a categorical variable that may take around 6 or 7 unique variables. Depending upon which variable that is, I need to run several functions - each of which is different depending upon the value of the categorical variable.
I don't know how to go about programming this so that things are called correctly. Keep in mind this might be simple here, but in my scenario is much more complicated with lots of sub functions.
library(dplyr)
func1_value_one = function(multiplication_value){
mtcars$check="value_one"
mtcars$mpg =mtcars*multiplication_value
filter(mtcars, mpg>60)
}
func0_value_zero = function(division_value){
mtcars$check="value_zero"
mtcars$mpg =mtcars$mpg / division_value
filter(mtcars, mpg <3)
}
help_function=function(category_p,change_p){
mtcars=return(filter(mtcars, vs==category_p))
data=ifelse(category_p==0,return(func0_value_zero(change_p)),return(func1_value_ one(change_p) ))
return(data)
}
#i want to filter for the values that meet the parameter passed in and then perform the update on the values
# right now I am not able to both filter for the values and then perform the correct function call.
help_function(0,2)
so
identifying_func(0,20) would return only only the rows from mtcars that has VS==0, divide all the mpg values by 20, and create a new column called check with all values equal to 'value_zero'
In the broader context, my dataset would use a flag to determine many different table join combinations and depending upon the data perform a variety of calculations and adjustments.