Problems using dplyr in a function (group_by) - r

I want to use dplyr for some data manipulation. Background: I have a survey weight and a bunch of variables (mostly likert-items). I want to sum the frequencies and percentages per category with and without survey weight.
As an example, let us just use frequencies for the gender variable. The result should be this:
gender freq freq.weighted
1 292 922.2906
2 279 964.7551
9 6 21.7338
I will do this for many variables. So, i decided to put the dplyr-code inside a function, so i only have to change the variable and type less.
#exampledata
gender<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
survey_weight<-c("2.368456","2.642901","2.926698","3.628653","3.247463","3.698195","2.776772","2.972387","2.686365","2.441820","3.494899","3.133106","3.253514","3.138839","3.430597","3.769577","3.367952","2.265350","2.686365","3.189538","3.029999","3.024567","2.972387","2.730978","4.074495","2.921552","3.769577","2.730978","3.247463","3.230097")
test_dataframe<-data.frame(gender,survey_weight)
#function
weighting.function<-function(dataframe,variable){
test_weighted<- dataframe %>%
group_by_(variable) %>%
summarise_(interp(freq=count(~weight)),
interp(freq_weighted=sum(~weight)))
return(test_weighted)
}
result_dataframe<-weighting.function(test_dataframe,"gender")
#this second step was left out in this example:
#mutate_(perc=interp(~freq/sum(~freq)*100),perc_weighted=interp(~freq_weighted/sum(~freq_weighted)*100))
This leads to the following Error-Message:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "formula"
I have tried a lot of different things. First, I used freq=n() to count the frequencies, but I always got an Error (i checked, that plyr was loaded before dplyr and not afterwards - it also didnĀ“t work.).
Any ideas? I read the vignette on standard evaluation. But, i always run into problems and have no idea what could be a solution.

I think you have a few nested mistakes which is causing you problems. The biggest one is using count() instead summarise(). I'm guessing you wanted n():
weighting.function <- function(dataframe, variable){
dataframe %>%
group_by_(variable) %>%
summarise_(
freq = ~n(),
freq_weighted = ~sum(survey_weight)
)
}
weighting.function(test_dataframe, ~gender)
You also had a few unneeded uses of interp(). If you do use interp(), the call should look like freq = interp(~n()), i.e. the name is outside the call to interp, and the thing being interpolated starts with ~.

Related

Create multiple csv based on groups of a categorical feature in a dataframe in R with split and map2 functions

I have been trying to create a simple function with a two arguments in R that takes a dataset as an example and a categorical feature, and based on that specific feature, stores in a folder ("DATA") inside the parent working directory multiple csv files grouped by the categories in that feature.
The problem I have been facing is as simple as the function may be: I introduced non-standard evaluation with rlang, but multiple errors jump at you for the enquo parameter (either the symbol expected or not being a vector). Therefore, function always fails.
The portion of code I used is the following, assuming always everyone has a folder called "DATA" in the project in Rstudio to store the splitted csv files.
library(tidyverse)
library(data.table)
library(rlang)
csv_splitter <- function(df, parameter){
df <- df
# We set categorical features missing values vector, with names automatically applied with
# sapply. We introduce enquo on the parameter for non-standard evaluation.
categories <- df %>% select(where(is.character))
NA_in_categories <- sapply(categories, FUN = function(x) {sum(is.na(x))})
parameter <- enquo(c(parameter))
#We make sure such parameter is included in the set of categorical features
if (!!parameter %in% names(NA_in_categories)) {
df %>%
split(paste0(".$", !!parameter)) %>%
map2(.y = names(.), ~ fwrite(.x, paste0('./DATA/data_dfparam_', .y, '.csv')))
print("The csv's are stored now in your DATA folder")
} else {
print("your variable is not here or it is continuous, buddy, try another one")
}
}
With an error in either "arg must be a symbol" in the enquo parameter, or with parameter not being a vector (which in this portion of code is solved with the "c(parameter)", I am stuck and unable to apply any other change to solve it.
If anyone does have a suggestion, I'll be more than happy to try it out on my code. In any case, I'll be extremely grateful for your help!

R apply multiple functions when large number of categories/types are present using case_when (R vectorization)

Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.

Variable not found in a function using dplyr functions

I tried for a few hours to create a function to get statistical information on quantitative variables.
Here it's a little part of my dataframe with many quantitative variables call tabl_profil1 :
DateDiag Age AgeDiag
<dbl> <dbl> <dbl>
1 1996. 43. 21.
2 2001. 53. 36.
3 2005. 75. 62.
4 1998. 62. 42.
5 2016. 53. 51.
6 2008. 65. 55.
I want to do a function to compute several statistical information (mean, median, max, min, confidence interval) and to have the resultats in a new dataframe.
I tried in different ways but I always met problems.
function1 <- function(VarName){results <<- tabl_profil1 %>% summarise(Mean = mean(VarName))}
function1(Age)
The mistakes is :
Error in summarise_impl(.data, dots) :
Evaluation error: object 'Age' not found.
I also tried with tabl_profil1[[VarName]] in the function but it's doesn't work.
Hope you can help me and thanks by advance,
Pierre
This is non-standard evaluation. If you want to use bare column names as is typical in dplyr functions, you need to use enquo to create a quosure. Then when you call that variable, you need a !! in front of its name. Try this:
function1 <- function(VarName){
var <- enquo(VarName)
results <<- tabl_profil1 %>% summarise(Mean = mean(!!var))
}
function1(Age)
In response to discussion in comments: using <<- inside a function like this isn't a great idea for a few reasons. First, it means that you're defining a function that only acts on a specific data frame, in this case tabl_profil1, and returns results to only a specific variable, in this case by assigning back to results. This pretty much defeats the purpose of writing a function, which is to flexibly repeat an operation.
<<- used this way also isn't that safe, since you'll end up with a value stored in results that you might not know exactly where it came from. It's better to be able to say you called a function and returned output to a certain variable, and you can see in your code exactly where you did that.
Also, the advantage of the dplyr model is that you can operate on a data frame in a function and pipe that output along to the next function. You lose this by not having a data frame as the first argument.
A better way to structure this function would be like:
function1 <- function(df, VarName){
var <- enquo(VarName)
df %>% summarise(Mean = mean(!!var))
}
Now this function operates on any data frame you pass it, and adds to that data frame the mean of any variable you include as the second argument. Now you can call something like:
mean_age <- function1(tabl_profil1, Age)
mean_height_from_other_tbl <- function1(other_table, Height)
This works on multiple data frames, and returns output that can be stored to whatever variable you want. Obviously I made up the second call as illustration.

R - Assign the mean of a column sub-sector to each row of that sub-sector

I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!

Summarise_each and dplyr syntax

I've been given a set of particularly messy data. In it there were three columns denoting the same factor variable - focus1, focus2, and focus3 where each observation of the data could contain more than one focus yet they are not a measure of magnitude i.e. the focus given in focus1 is not necessarily a stronger focus than that of focus2. I need to expand these three variables to indicator variables for each possible level of a consolidated focus variable. To do this, I used the code below, and it worked perfectly on my PC yesterday but I work on a mac in my office and I am now running into problems.
# Create focus variables
spr.focus<- y1 %>%
gather(foc_num, focus, starts_with("focus")) %>%
mutate(present = 1) %>%
spread(focus, present, fill = 0)
# Reorder data on ID var while removing unnecessary columns
spr.focus <- spr.focus[order(spr.focus$tid), -c(34, 54)]
# Group by ID var and summarise indicator variables to get one obs per ID
focusvars <- spr.focus %>%
group_by(tid) %>% # tid is id var
summarise_each(funs(sum), Arts:Unclear)
I have run into two problems:
summarise_each appears to have been made obsolete on Mac and not Windows? The answer here appears to be to use summarise_at. Can I use the same x:y notation for denoting the columns to summarise? This is important because there are around 20-30 columns between the first and last index.
For some reason R no longer recognises column names I refer to within the pipe notation. I get an error "Error in eval_bare(dot$expr, dot$env) : object 'Arts' not found".
I'm also quite curious, what is causing these disparities between operating on Windows and Mac? I have to imagine it is different versions of packages/RStudio itself but it is creating quite a conundrum.
After some tinkering with summarize_at, I found my solution:
focusvars <- spr.focus %>%
group_by(tid) %>% # tid is an id var
summarise_at(vars(Arts:Unclear),funs(sum))
For some reason, it still throws errors in the margin that cannot find colnames in scope but it creates the new dataframe. I'll leave this up in case this is helpful to others.

Resources