I am trying to use ddply and summarise together from the plyr package but am having difficulty parsing through column names that keep changing...In my example i would like something that would parse in X1 programatically rather than hard coding in X1 into the ddply function.
setting up an example
require(xts)
require(plyr)
require(reshape2)
require(lubridate)
t <- xts(matrix(rnorm(10000),ncol=10), Sys.Date()-1000:1)
t.df <- data.frame(coredata(t))
t.df <- cbind(day=wday(index(t), label=TRUE, abbr=TRUE), t.df)
t.df.l <- melt(t.df, id.vars=c("day",colnames(t.df)[2]), measure.vars=colnames(t.df)[3:ncol(t.df)])
This is the bit im am struggling with....
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(X1, value))
i do not want to use the term X1 and would like to use something like
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(colnames(t.df)[2], value))
but that comes up with the error: Error in cor(colnames(t.df)[2], value) : 'x' must be numeric
I also tried various other combos that parse in the vector values for the x argument in cor...but for some reason none of them seem to work...
any ideas?
Although this is probably not the intended usage for summarize and there must be much better approaches to your problem, the direct answer to your question is to use get:
ddply(t.df.l, c("day","variable"), summarise, cor(get(colnames(t.df)[2]), value))
Edit: here is for example one approach that is in my opinion better suited to your problem:
ddply(t.df.l, c("day", "variable"), function(x)cor(x["X1"], x["value"]))
Above, "X1" can be also replaced by 2 or the name of a variable holding "X1", etc. It depends how you want to programmatically access the column.
Related
How could I refer to a variable new in a function without writing them as string or indexing before. I want to construct a function where I can replace grouping variables easily.
For example:
final_table %>% data.table::dcast(Lipids ~ combi_new)
Another time this factor variable could be named differently.
e.g., group_2
It doesn't work with final_table[,2] - how could I solve this?
Thanks,
Nadine
I'm not sure I understand the question, are you trying to build the formula for a dcast from the column indices?
Maybe that answers your problem:
library(data.table)
setDT(final_table)
colnames(final_table)
measurevar <- colnames(final_table)[2]
groupvar <- colnames(final_table)[1]
valvar <- colnames(final_table)[3]
formula <- as.formula(paste(measurevar, groupvar, sep=" ~ "))
test <- dcast(final_table, formula, value.var = valvar)
Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.
I'm trying to use the ddply-summarise function (e.g. mean()) within a custom function. However, instead of resulting in the means for each group, it results in a dataframe showing the mean of all observations.
Many thanks already in advance for your help!
library(plyr)
library(dplyr)
df <- data.frame(Titanic)
colnames(df)
# ddply-summarise - Outside of function
df.OutsideOfFunction <- ddply(df, c("Class","Sex"), summarise,
Mean=mean(Freq))
# new function
newFunction <- function(data, GroupVariables, ColA){
mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise,
Mean=mean(data[[ColA]]))
}
#ddply-summarise - InsideOfFunction
df.InsideOfFunction <- newFunction(data=df,
GroupVariables=c("Class","Sex"),
ColA ="Freq")
It should work this way, by converting ColA input first to symbol and then evaluating it:
# new function
newFunction <- function(data, GroupVariables, ColA){
#mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise, Mean=mean(UQ(sym(ColA))))
}
Please take a look also in this post as to why this happens. It's the first time i've seen it myself so i am not the best one to explain it - it looks like it depends on the way summarize and/or other plyr or dplyr functions accept parameters as input (with/without quote) and how these are evaluated.
Also since you are loading dplyr as well, you can stick to one package if you like and write your function like this:
newFunction <- function(data, GroupVariables, ColA){
data %>% group_by(.dots=GroupVariables) %>% summarise(Mean=mean(UQ(sym(ColA))))
}
Hope this helps
Really struggling with putting dplyr functions within my functions. I understand the function_ suffix for the standard evaluation versions, but still having problems, and seemingly tried all combinations of eval paste and lazy.
Trying to divide multiple columns by the median of the control for a group. Example data includes an additional column in iris named 'Control', so each species has 40 'normal', and 10 'control'.
data(iris)
control <- rep(c(rep("normal", 40), rep("control", 10)), 3)
iris$Control <- control
Normal dplyr works fine:
out_df <- iris %>%
group_by(Species) %>%
mutate_each(funs(./median(.[Control == "control"])), 1:4)
Trying to wrap this up into a function:
norm_iris <- function(df, control_col, control_val, species, num_cols = 1:4){
out <- df %>%
group_by_(species) %>%
mutate_each_(funs(./median(.[control_col == control])), num_cols)
return(out)
}
norm_iris(iris, control_col = "Control", control_val = "control", species = "Species")
I get the error:
Error in UseMethod("as.lazy_dots") :
no applicable method for 'as.lazy_dots' applied to an object of class "c('integer', 'numeric')"
Using funs_ instead of funs I get Error:...: need numeric data
If you haven't already, it might help you to read the vignette on standard evaluation here, although it sounds like some of this may be changing soon.
Your function is missing the use of interp from package lazyeval in the mutate_each_ line. Because you are trying to use a variable name (the Control variable) in the funs, you need funs_ in this situation along with interp. Notice that this is a situation where you don't need mutate_each_ at all. You would need it if you were trying to use column names instead of column numbers when selecting the columns you want to mutate.
Here is what the line would look like in your function instead of what you have:
mutate_each(funs_(interp(~./median(.[x == control_val]), x = as.name(control_col))),
num_cols)
I'm using tidyr together with shiny and hence needs to utilize dynamic values in tidyr operations.
However I do have trouble using the gather_(), which I think was designed for such case.
Minimal example below:
library(tidyr)
df <- data.frame(name=letters[1:5],v1=1:5,v2=10:14,v3=7:11,stringsAsFactors=FALSE)
#works fine
df %>% gather(Measure,Qty,v1:v3)
dyn_1 <- 'Measure'
dyn_2 <- 'Qty'
dyn_err <- 'v1:v3'
dyn_err_1 <- 'v1'
dyn_err_2 <- 'v2'
#error
df %>% gather_(dyn_1,dyn_2,dyn_err)
#error
df %>% gather_(dyn_1,dyn_2,dyn_err_1:dyn_err_2)
after some debug I realized the error happened at melt measure.vars part, but I don't know how to get it work with the ':' there...
Please help with a solution and explain a little bit so I could learn more.
You are telling gather_ to look for the colume 'v1:v3' not on the separate column ids. Simply change dyn_err <- "v1:v3" to dyn_err <- paste("v", seq(3), sep="").
If you df has different column names (e.g. var_a, qtr_b, stg_c), you can either extract those column names or use the paste function for whichever variables are of interest.
dyn_err <- colnames(df)[2:4]
or
dyn_err <- paste(c("var", "qtr", "stg"), letters[1:3], sep="_")
You need to look at what column names you want and make the corresponding vector.