R and dplyr: how to use ifelse condition with an external dataframe - r

I'm using dplyr and spark to create a new variable with the mutate command. This new variable new_variable is categorical and must be ALFA if the value of the variable my_data_variable is inside a column of another dataframe other_df$one_column. Consequently its value will be BETA if its value it
it is not included in the values ​​of other_df$one_column
an example of what I did:
my_data %>%
mutate(new_variable = ifelse(my_data_variable == other_df$one_column, "ALFA","BETA"))
but unfortunately I get this error. Even using !!other_df$one_column or local(other_df[['one_column']])
instead of other_df$one_column does not work.
Error: Cannot embed a data frame in a SQL query.
If you are seeing this error in code that used to work, the most likely cause is a change dbplyr 1.4.0. Previously `df$x` or
`df[[y]]` implied that `df` was a local variable, but now you must make that explict with `!!` or `local()`, e.g., `!!df$x` or
`local(df[["y"]))
Are there alternative methods to the ifelse function to get the expected result?

Thanks #RonakShah for his help. The solution is the following:
my_data %>%
mutate(new_variable = ifelse(my_data_variable %in% !!other_df$one_column, "ALFA","BETA"))

Related

Create multiple csv based on groups of a categorical feature in a dataframe in R with split and map2 functions

I have been trying to create a simple function with a two arguments in R that takes a dataset as an example and a categorical feature, and based on that specific feature, stores in a folder ("DATA") inside the parent working directory multiple csv files grouped by the categories in that feature.
The problem I have been facing is as simple as the function may be: I introduced non-standard evaluation with rlang, but multiple errors jump at you for the enquo parameter (either the symbol expected or not being a vector). Therefore, function always fails.
The portion of code I used is the following, assuming always everyone has a folder called "DATA" in the project in Rstudio to store the splitted csv files.
library(tidyverse)
library(data.table)
library(rlang)
csv_splitter <- function(df, parameter){
df <- df
# We set categorical features missing values vector, with names automatically applied with
# sapply. We introduce enquo on the parameter for non-standard evaluation.
categories <- df %>% select(where(is.character))
NA_in_categories <- sapply(categories, FUN = function(x) {sum(is.na(x))})
parameter <- enquo(c(parameter))
#We make sure such parameter is included in the set of categorical features
if (!!parameter %in% names(NA_in_categories)) {
df %>%
split(paste0(".$", !!parameter)) %>%
map2(.y = names(.), ~ fwrite(.x, paste0('./DATA/data_dfparam_', .y, '.csv')))
print("The csv's are stored now in your DATA folder")
} else {
print("your variable is not here or it is continuous, buddy, try another one")
}
}
With an error in either "arg must be a symbol" in the enquo parameter, or with parameter not being a vector (which in this portion of code is solved with the "c(parameter)", I am stuck and unable to apply any other change to solve it.
If anyone does have a suggestion, I'll be more than happy to try it out on my code. In any case, I'll be extremely grateful for your help!

is it correct to select a variable which is not available in data frame? if yes, how?

I just came across a question, in which selection of a data frame variable was done, which was not a part of it. I just selected it as incorrect option. But it was correct. Please help me understand, how it works?
I cross verified code by running in R Console and it ran fine.
df <- data.frame(x = 1:10)
df %>% mutate(xy = paste(x,df$isItPossible))
According to me, the statement should throw some error. but it's running correctly."isItPossible" is a variable not available in df.
When you run
df$isItPossible
it doesn't return an error, it returns NULL. This type of stuff is allowed so that you can create new columns with
df$isItPossible <- "Yes"
And the paste function doesn't have a problem with NULL values. It just ignores them.
paste("x", NULL)
# [1] "x "
But when using mutate, you really shouldn't use the df$ part. It's meant to be run as
df %>% mutate(xy = paste(x, isItPossible))
which would give you an error about the value not being found which is what you want.

dplyr select executes weirdly another column

the object 'customer_profiling_vars' is a dataframe with just variable selected by a clustering algorithm (RSKC) as seen in R output below the R code:
customer_profiling_vars
customer_profiling_vars$Variables
Now, I want to select only those variables to my dataset sc_df_tr_dummified in the above vector of variables from the dataframe 'customer_profiling_vars' using dplyr's 'select' :
customer_df_interprete = sc_df_tr_dummified %>%
select(customer_profiling_vars$Variables)
glimpse(customer_df_interprete)
I expect to get the variable 'SalePrice' selected.
However some other variable ('PoolArea.576') gets selected which is very weird:
Just to be sure, I tried using SalePrice directly instead of customer_profiling_vars$Variables, it gives what I intended:
What is wrong with select of dplyr? For me , it seems like it has something to do with the factor nature of 'customer_profiling_vars$Variables':
Thanks in advance!

R filter function won't filter out results

I'm trying to find all rows where values exist between a top and bottom depth value in Azure ML. I'm using dplyr's filter function, and the code doesn't throw an error. But when I look at the results it hasn't filtered anything. Can somebody see where I'm going wrong?
library(dplyr)
Upper_Depth<-dataset1$Upper_Depth
Lower_Depth<-dataset1$Lower_Depth
TopMD<-dataset1$TopMD
BaseMD<-dataset1$BaseMD
# Filter where the Perf is within the Upper and Lower Depth intervals:
#select(Upper_Depth, Lower_Depth, TopMD, BaseMD) %>%
filter(dataset1, Upper_Depth > TopMD & Lower_Depth < BaseMD);
# Subset the data where the perfs are in the L_WSEC_A:
#subset(dataset1, Upper_Depth > TopMD & Lower_Depth < BaseMD)
full_data <- dataset1
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("full_data")
I tried the subset function but get the same result. I apologize because I'm very new to r and Azure ML Studio.
Welcome to R!
The reason your code is not working as expected is because the first four lines in your script are assigning vectors. This would work fine if you were using base R subsetting (try ?'[' at the console) and performing logic tests in the columns as vectors.
dplyr works somewhat differently. The metaphor is closer to SQL, treating each "column" in the dataset as a SQL field. So, you can work with your variables directly, without subsetting them out into vectors.
Try:
dataset1 %>% filter(Upper_Depth > TopMD & Lower_Depth < BaseMD)
It should give you your dataset, filtered so that each condition evaluates as TRUE.
If you want to save the results, just left-assign them like this:
full_data <- dataset1 %>% filter(Upper_Depth > TopMD & Lower_Depth < BaseMD)
I hope this helps!

Problems using dplyr in a function (group_by)

I want to use dplyr for some data manipulation. Background: I have a survey weight and a bunch of variables (mostly likert-items). I want to sum the frequencies and percentages per category with and without survey weight.
As an example, let us just use frequencies for the gender variable. The result should be this:
gender freq freq.weighted
1 292 922.2906
2 279 964.7551
9 6 21.7338
I will do this for many variables. So, i decided to put the dplyr-code inside a function, so i only have to change the variable and type less.
#exampledata
gender<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
survey_weight<-c("2.368456","2.642901","2.926698","3.628653","3.247463","3.698195","2.776772","2.972387","2.686365","2.441820","3.494899","3.133106","3.253514","3.138839","3.430597","3.769577","3.367952","2.265350","2.686365","3.189538","3.029999","3.024567","2.972387","2.730978","4.074495","2.921552","3.769577","2.730978","3.247463","3.230097")
test_dataframe<-data.frame(gender,survey_weight)
#function
weighting.function<-function(dataframe,variable){
test_weighted<- dataframe %>%
group_by_(variable) %>%
summarise_(interp(freq=count(~weight)),
interp(freq_weighted=sum(~weight)))
return(test_weighted)
}
result_dataframe<-weighting.function(test_dataframe,"gender")
#this second step was left out in this example:
#mutate_(perc=interp(~freq/sum(~freq)*100),perc_weighted=interp(~freq_weighted/sum(~freq_weighted)*100))
This leads to the following Error-Message:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "formula"
I have tried a lot of different things. First, I used freq=n() to count the frequencies, but I always got an Error (i checked, that plyr was loaded before dplyr and not afterwards - it also didn´t work.).
Any ideas? I read the vignette on standard evaluation. But, i always run into problems and have no idea what could be a solution.
I think you have a few nested mistakes which is causing you problems. The biggest one is using count() instead summarise(). I'm guessing you wanted n():
weighting.function <- function(dataframe, variable){
dataframe %>%
group_by_(variable) %>%
summarise_(
freq = ~n(),
freq_weighted = ~sum(survey_weight)
)
}
weighting.function(test_dataframe, ~gender)
You also had a few unneeded uses of interp(). If you do use interp(), the call should look like freq = interp(~n()), i.e. the name is outside the call to interp, and the thing being interpolated starts with ~.

Resources