the object 'customer_profiling_vars' is a dataframe with just variable selected by a clustering algorithm (RSKC) as seen in R output below the R code:
customer_profiling_vars
customer_profiling_vars$Variables
Now, I want to select only those variables to my dataset sc_df_tr_dummified in the above vector of variables from the dataframe 'customer_profiling_vars' using dplyr's 'select' :
customer_df_interprete = sc_df_tr_dummified %>%
select(customer_profiling_vars$Variables)
glimpse(customer_df_interprete)
I expect to get the variable 'SalePrice' selected.
However some other variable ('PoolArea.576') gets selected which is very weird:
Just to be sure, I tried using SalePrice directly instead of customer_profiling_vars$Variables, it gives what I intended:
What is wrong with select of dplyr? For me , it seems like it has something to do with the factor nature of 'customer_profiling_vars$Variables':
Thanks in advance!
Related
I'm using dplyr and spark to create a new variable with the mutate command. This new variable new_variable is categorical and must be ALFA if the value of the variable my_data_variable is inside a column of another dataframe other_df$one_column. Consequently its value will be BETA if its value it
it is not included in the values of other_df$one_column
an example of what I did:
my_data %>%
mutate(new_variable = ifelse(my_data_variable == other_df$one_column, "ALFA","BETA"))
but unfortunately I get this error. Even using !!other_df$one_column or local(other_df[['one_column']])
instead of other_df$one_column does not work.
Error: Cannot embed a data frame in a SQL query.
If you are seeing this error in code that used to work, the most likely cause is a change dbplyr 1.4.0. Previously `df$x` or
`df[[y]]` implied that `df` was a local variable, but now you must make that explict with `!!` or `local()`, e.g., `!!df$x` or
`local(df[["y"]))
Are there alternative methods to the ifelse function to get the expected result?
Thanks #RonakShah for his help. The solution is the following:
my_data %>%
mutate(new_variable = ifelse(my_data_variable %in% !!other_df$one_column, "ALFA","BETA"))
I was wandering if it is possible to use the following data.table feature without providing column names:
dt <- data.table(mtcars)[,.(mpg, cyl)]
dt[,`:=`(avg=mean(mpg), med=median(mpg))]
Let's say for example that I have a function that return more than one column like this
mfun=function(x){cbind(x^2,x^3)}
But if I want to assign it as new columns that specific way, R would execute function mfun twice, which is not efficient.
dt[,`:=`(sqr=mfunc(mpg)[,1], cub=mfunc(mpg)[,2])]
So, without 'work arounds', is it possible to do something similar to this:
dt[,`:=`(mfunc(mpg))] #this returns an error
dt[,`:=`(error2=mfunc(mpg))] #this returns an error
I am trying to create a variable that contains "buckets" of a numeric value in another column. For example:
nts$size_bucket<-cut(nts$loansize, c(0, 5000,10000, 25000,50000,100000,200000,300000, 500000,Inf),
c('<$5K', '5-10K', '10-25K', '25-50K', '50-100K', '100-200K', '200-300K', '300-500K', '500K+'))
In normal R, cut would work perfectly, but it doesn't appear to work with a SparkR dataframe and gives the exception:
'x' must be numeric
even though x is numeric.
Any suggestions for how to accomplish this in SparkR?
Thanks!
I try to make a scatter-plot matrix with a dataframe(here it is http://statweb.stanford.edu/~tibs/ElemStatLearn/). However, the order of the variables is not the one that I wish and I would like to ignore the variable train.
Dataframe order:
lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa,train
The order I wish:
lpsa, lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45
For the moment, here is my code:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate=as.data.frame.matrix(prostate1)
pairs(prostate, col="purple")
I tried to add the arguments horInd and verInd, but I get the following warnings:
1: horInd" is not a graphical parameter
2: verInd" is not a graphical parameter
If anyone could help me, it would really be appreciated.
try this:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate = as.matrix(prostate1)
prostate.reordered = prostate[, c("lpsa", "lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")]
pairs(prostate.reordered, col="purple")
The idea is to select the columns you want, in the order you want, using the column names for selection.
Of course, it would probably even more efficient not to convert everything from the data frame into a matrix, but only the required columns...
I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele
I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.