I have been trying to create a simple function with a two arguments in R that takes a dataset as an example and a categorical feature, and based on that specific feature, stores in a folder ("DATA") inside the parent working directory multiple csv files grouped by the categories in that feature.
The problem I have been facing is as simple as the function may be: I introduced non-standard evaluation with rlang, but multiple errors jump at you for the enquo parameter (either the symbol expected or not being a vector). Therefore, function always fails.
The portion of code I used is the following, assuming always everyone has a folder called "DATA" in the project in Rstudio to store the splitted csv files.
library(tidyverse)
library(data.table)
library(rlang)
csv_splitter <- function(df, parameter){
df <- df
# We set categorical features missing values vector, with names automatically applied with
# sapply. We introduce enquo on the parameter for non-standard evaluation.
categories <- df %>% select(where(is.character))
NA_in_categories <- sapply(categories, FUN = function(x) {sum(is.na(x))})
parameter <- enquo(c(parameter))
#We make sure such parameter is included in the set of categorical features
if (!!parameter %in% names(NA_in_categories)) {
df %>%
split(paste0(".$", !!parameter)) %>%
map2(.y = names(.), ~ fwrite(.x, paste0('./DATA/data_dfparam_', .y, '.csv')))
print("The csv's are stored now in your DATA folder")
} else {
print("your variable is not here or it is continuous, buddy, try another one")
}
}
With an error in either "arg must be a symbol" in the enquo parameter, or with parameter not being a vector (which in this portion of code is solved with the "c(parameter)", I am stuck and unable to apply any other change to solve it.
If anyone does have a suggestion, I'll be more than happy to try it out on my code. In any case, I'll be extremely grateful for your help!
I just came across a question, in which selection of a data frame variable was done, which was not a part of it. I just selected it as incorrect option. But it was correct. Please help me understand, how it works?
I cross verified code by running in R Console and it ran fine.
df <- data.frame(x = 1:10)
df %>% mutate(xy = paste(x,df$isItPossible))
According to me, the statement should throw some error. but it's running correctly."isItPossible" is a variable not available in df.
When you run
df$isItPossible
it doesn't return an error, it returns NULL. This type of stuff is allowed so that you can create new columns with
df$isItPossible <- "Yes"
And the paste function doesn't have a problem with NULL values. It just ignores them.
paste("x", NULL)
# [1] "x "
But when using mutate, you really shouldn't use the df$ part. It's meant to be run as
df %>% mutate(xy = paste(x, isItPossible))
which would give you an error about the value not being found which is what you want.
the object 'customer_profiling_vars' is a dataframe with just variable selected by a clustering algorithm (RSKC) as seen in R output below the R code:
customer_profiling_vars
customer_profiling_vars$Variables
Now, I want to select only those variables to my dataset sc_df_tr_dummified in the above vector of variables from the dataframe 'customer_profiling_vars' using dplyr's 'select' :
customer_df_interprete = sc_df_tr_dummified %>%
select(customer_profiling_vars$Variables)
glimpse(customer_df_interprete)
I expect to get the variable 'SalePrice' selected.
However some other variable ('PoolArea.576') gets selected which is very weird:
Just to be sure, I tried using SalePrice directly instead of customer_profiling_vars$Variables, it gives what I intended:
What is wrong with select of dplyr? For me , it seems like it has something to do with the factor nature of 'customer_profiling_vars$Variables':
Thanks in advance!
I'm trying to find all rows where values exist between a top and bottom depth value in Azure ML. I'm using dplyr's filter function, and the code doesn't throw an error. But when I look at the results it hasn't filtered anything. Can somebody see where I'm going wrong?
library(dplyr)
Upper_Depth<-dataset1$Upper_Depth
Lower_Depth<-dataset1$Lower_Depth
TopMD<-dataset1$TopMD
BaseMD<-dataset1$BaseMD
# Filter where the Perf is within the Upper and Lower Depth intervals:
#select(Upper_Depth, Lower_Depth, TopMD, BaseMD) %>%
filter(dataset1, Upper_Depth > TopMD & Lower_Depth < BaseMD);
# Subset the data where the perfs are in the L_WSEC_A:
#subset(dataset1, Upper_Depth > TopMD & Lower_Depth < BaseMD)
full_data <- dataset1
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("full_data")
I tried the subset function but get the same result. I apologize because I'm very new to r and Azure ML Studio.
Welcome to R!
The reason your code is not working as expected is because the first four lines in your script are assigning vectors. This would work fine if you were using base R subsetting (try ?'[' at the console) and performing logic tests in the columns as vectors.
dplyr works somewhat differently. The metaphor is closer to SQL, treating each "column" in the dataset as a SQL field. So, you can work with your variables directly, without subsetting them out into vectors.
Try:
dataset1 %>% filter(Upper_Depth > TopMD & Lower_Depth < BaseMD)
It should give you your dataset, filtered so that each condition evaluates as TRUE.
If you want to save the results, just left-assign them like this:
full_data <- dataset1 %>% filter(Upper_Depth > TopMD & Lower_Depth < BaseMD)
I hope this helps!
I want to use dplyr for some data manipulation. Background: I have a survey weight and a bunch of variables (mostly likert-items). I want to sum the frequencies and percentages per category with and without survey weight.
As an example, let us just use frequencies for the gender variable. The result should be this:
gender freq freq.weighted
1 292 922.2906
2 279 964.7551
9 6 21.7338
I will do this for many variables. So, i decided to put the dplyr-code inside a function, so i only have to change the variable and type less.
#exampledata
gender<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
survey_weight<-c("2.368456","2.642901","2.926698","3.628653","3.247463","3.698195","2.776772","2.972387","2.686365","2.441820","3.494899","3.133106","3.253514","3.138839","3.430597","3.769577","3.367952","2.265350","2.686365","3.189538","3.029999","3.024567","2.972387","2.730978","4.074495","2.921552","3.769577","2.730978","3.247463","3.230097")
test_dataframe<-data.frame(gender,survey_weight)
#function
weighting.function<-function(dataframe,variable){
test_weighted<- dataframe %>%
group_by_(variable) %>%
summarise_(interp(freq=count(~weight)),
interp(freq_weighted=sum(~weight)))
return(test_weighted)
}
result_dataframe<-weighting.function(test_dataframe,"gender")
#this second step was left out in this example:
#mutate_(perc=interp(~freq/sum(~freq)*100),perc_weighted=interp(~freq_weighted/sum(~freq_weighted)*100))
This leads to the following Error-Message:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "formula"
I have tried a lot of different things. First, I used freq=n() to count the frequencies, but I always got an Error (i checked, that plyr was loaded before dplyr and not afterwards - it also didn´t work.).
Any ideas? I read the vignette on standard evaluation. But, i always run into problems and have no idea what could be a solution.
I think you have a few nested mistakes which is causing you problems. The biggest one is using count() instead summarise(). I'm guessing you wanted n():
weighting.function <- function(dataframe, variable){
dataframe %>%
group_by_(variable) %>%
summarise_(
freq = ~n(),
freq_weighted = ~sum(survey_weight)
)
}
weighting.function(test_dataframe, ~gender)
You also had a few unneeded uses of interp(). If you do use interp(), the call should look like freq = interp(~n()), i.e. the name is outside the call to interp, and the thing being interpolated starts with ~.