Variable not found in a function using dplyr functions - r

I tried for a few hours to create a function to get statistical information on quantitative variables.
Here it's a little part of my dataframe with many quantitative variables call tabl_profil1 :
DateDiag Age AgeDiag
<dbl> <dbl> <dbl>
1 1996. 43. 21.
2 2001. 53. 36.
3 2005. 75. 62.
4 1998. 62. 42.
5 2016. 53. 51.
6 2008. 65. 55.
I want to do a function to compute several statistical information (mean, median, max, min, confidence interval) and to have the resultats in a new dataframe.
I tried in different ways but I always met problems.
function1 <- function(VarName){results <<- tabl_profil1 %>% summarise(Mean = mean(VarName))}
function1(Age)
The mistakes is :
Error in summarise_impl(.data, dots) :
Evaluation error: object 'Age' not found.
I also tried with tabl_profil1[[VarName]] in the function but it's doesn't work.
Hope you can help me and thanks by advance,
Pierre

This is non-standard evaluation. If you want to use bare column names as is typical in dplyr functions, you need to use enquo to create a quosure. Then when you call that variable, you need a !! in front of its name. Try this:
function1 <- function(VarName){
var <- enquo(VarName)
results <<- tabl_profil1 %>% summarise(Mean = mean(!!var))
}
function1(Age)
In response to discussion in comments: using <<- inside a function like this isn't a great idea for a few reasons. First, it means that you're defining a function that only acts on a specific data frame, in this case tabl_profil1, and returns results to only a specific variable, in this case by assigning back to results. This pretty much defeats the purpose of writing a function, which is to flexibly repeat an operation.
<<- used this way also isn't that safe, since you'll end up with a value stored in results that you might not know exactly where it came from. It's better to be able to say you called a function and returned output to a certain variable, and you can see in your code exactly where you did that.
Also, the advantage of the dplyr model is that you can operate on a data frame in a function and pipe that output along to the next function. You lose this by not having a data frame as the first argument.
A better way to structure this function would be like:
function1 <- function(df, VarName){
var <- enquo(VarName)
df %>% summarise(Mean = mean(!!var))
}
Now this function operates on any data frame you pass it, and adds to that data frame the mean of any variable you include as the second argument. Now you can call something like:
mean_age <- function1(tabl_profil1, Age)
mean_height_from_other_tbl <- function1(other_table, Height)
This works on multiple data frames, and returns output that can be stored to whatever variable you want. Obviously I made up the second call as illustration.

Related

Create multiple csv based on groups of a categorical feature in a dataframe in R with split and map2 functions

I have been trying to create a simple function with a two arguments in R that takes a dataset as an example and a categorical feature, and based on that specific feature, stores in a folder ("DATA") inside the parent working directory multiple csv files grouped by the categories in that feature.
The problem I have been facing is as simple as the function may be: I introduced non-standard evaluation with rlang, but multiple errors jump at you for the enquo parameter (either the symbol expected or not being a vector). Therefore, function always fails.
The portion of code I used is the following, assuming always everyone has a folder called "DATA" in the project in Rstudio to store the splitted csv files.
library(tidyverse)
library(data.table)
library(rlang)
csv_splitter <- function(df, parameter){
df <- df
# We set categorical features missing values vector, with names automatically applied with
# sapply. We introduce enquo on the parameter for non-standard evaluation.
categories <- df %>% select(where(is.character))
NA_in_categories <- sapply(categories, FUN = function(x) {sum(is.na(x))})
parameter <- enquo(c(parameter))
#We make sure such parameter is included in the set of categorical features
if (!!parameter %in% names(NA_in_categories)) {
df %>%
split(paste0(".$", !!parameter)) %>%
map2(.y = names(.), ~ fwrite(.x, paste0('./DATA/data_dfparam_', .y, '.csv')))
print("The csv's are stored now in your DATA folder")
} else {
print("your variable is not here or it is continuous, buddy, try another one")
}
}
With an error in either "arg must be a symbol" in the enquo parameter, or with parameter not being a vector (which in this portion of code is solved with the "c(parameter)", I am stuck and unable to apply any other change to solve it.
If anyone does have a suggestion, I'll be more than happy to try it out on my code. In any case, I'll be extremely grateful for your help!

R apply multiple functions when large number of categories/types are present using case_when (R vectorization)

Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.

R - Assign the mean of a column sub-sector to each row of that sub-sector

I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!

Problems using dplyr in a function (group_by)

I want to use dplyr for some data manipulation. Background: I have a survey weight and a bunch of variables (mostly likert-items). I want to sum the frequencies and percentages per category with and without survey weight.
As an example, let us just use frequencies for the gender variable. The result should be this:
gender freq freq.weighted
1 292 922.2906
2 279 964.7551
9 6 21.7338
I will do this for many variables. So, i decided to put the dplyr-code inside a function, so i only have to change the variable and type less.
#exampledata
gender<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
survey_weight<-c("2.368456","2.642901","2.926698","3.628653","3.247463","3.698195","2.776772","2.972387","2.686365","2.441820","3.494899","3.133106","3.253514","3.138839","3.430597","3.769577","3.367952","2.265350","2.686365","3.189538","3.029999","3.024567","2.972387","2.730978","4.074495","2.921552","3.769577","2.730978","3.247463","3.230097")
test_dataframe<-data.frame(gender,survey_weight)
#function
weighting.function<-function(dataframe,variable){
test_weighted<- dataframe %>%
group_by_(variable) %>%
summarise_(interp(freq=count(~weight)),
interp(freq_weighted=sum(~weight)))
return(test_weighted)
}
result_dataframe<-weighting.function(test_dataframe,"gender")
#this second step was left out in this example:
#mutate_(perc=interp(~freq/sum(~freq)*100),perc_weighted=interp(~freq_weighted/sum(~freq_weighted)*100))
This leads to the following Error-Message:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "formula"
I have tried a lot of different things. First, I used freq=n() to count the frequencies, but I always got an Error (i checked, that plyr was loaded before dplyr and not afterwards - it also didnĀ“t work.).
Any ideas? I read the vignette on standard evaluation. But, i always run into problems and have no idea what could be a solution.
I think you have a few nested mistakes which is causing you problems. The biggest one is using count() instead summarise(). I'm guessing you wanted n():
weighting.function <- function(dataframe, variable){
dataframe %>%
group_by_(variable) %>%
summarise_(
freq = ~n(),
freq_weighted = ~sum(survey_weight)
)
}
weighting.function(test_dataframe, ~gender)
You also had a few unneeded uses of interp(). If you do use interp(), the call should look like freq = interp(~n()), i.e. the name is outside the call to interp, and the thing being interpolated starts with ~.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Resources