Why is dcast not working in reshape2? - r

This paper that was published for reshape package (Wickham 2007) gave this example:
library(reshape2)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
dcast(ffm, variable ~ ., c(min, max))
Similarly, this doesn't work in reshape2 but appears to work in Wickham 2007
dcast(ffm, variable ~ ., summary)
However the cast function is giving an error. How can I get function to work?

The paper is for the reshape package, not the reshape2 package. You have also not reproduced the example as it was written. It should be:
library("reshape") # not explicit in the paper, but implied since it is for the reshape pacakge
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
cast(ffm, treatment ~ rep, c(min, max))
Note that the function call is cast, not dcast. That change was one of the major changes between the two packages. Another was the dropping of multiple aggregation at the same time as reshaping as this was considered to be better handled by the plyr package. If you use the reshape package (which is still available from CRAN), the examples work.

Related

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

R - Fitting a model per subject using data.table or dplyr

I have a set of observations for many subjects and I would like to fit a model for each subject.
I"m using the packages data.table and fitdistrplus, but could also try to use dlpyr.
Say my data are of this form:
#subject_id #observation
1 35
1 38
2 44
2 49
Here's what I've tried so far:
subject_models <- dt[,fitdist(observation, "norm", method = "mme"), by=subject_id]
This causes an error I think because the call to fitdist returns a fitdist object which is not possible to store in a datatable/dataframe.
Is there any intuitive way to do this using data.table or dplyr?
EDIT: A dplyr answer was provided, but I would appreciate a data.table one as well, I'll try to run some benchmarks against the two.
This can be easily achieved with the purrr package
I assume its the same thing #alistaire suggested
library(purrr)
library(dplyr)
library(fitdistrplus)
dt %>% split(dt$subject_id) %>% map( ~ fitdist(.$observation, "norm", method = "mme"))
Alternatively, without purrr,
dt %>% split(dt$subject_id) %>% lapply(., function(x) fitdist(x$observation, "norm", method = "mme"))

dplyr: colSums on sub-grouped (group_by) data frames: elegantly

I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).
Example code:
countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
then what I'd like to do is ...
library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do(...) bit seems very clunky.
Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.
Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.
Edit: results from suggestions
Three solutions:
My suggestion in post: elapsed, 146.765 seconds.
#joran's suggestion below: 6.902 seconds
#eddi's suggestion in the comments, using data.table: 6.715 seconds.
I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.
Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)
The method summarise_each mentioned in joran's answer from 2014 has been deprecated.
Instead, please use summarize_all() or summarize_at().
The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.
Instead, please use summarize()/summarise() combined with across().

cast {reshape}: using variables instead of the columns' name

Say I have a dataframe:
data <- data.frame(id=c(1,2,2,2),
code=c("A","B","A","B"),
area=c(23.1,56.0,45.8,78.5))
and this line of code that's working fine:
df<-cast(data,id~code,fun.aggregate=sum)
Then I create the following variables:
ID <- "id"
CODE <- "code"
and use the variables as arguments in the cast function:
df <- cast(data, ID~CODE, fun.aggregate=sum)
Then I get the following error:
Error: Casting formula contains variables not found in molten data: ID, CODE
How can I use variables instead of the columns' name with the cast function?
You need to construct a formula:
cast(data, as.formula(paste(ID, CODE, sep="~")), fun.aggregate=sum)
However, package reshape has been superseded by package reshape2 (have a look at its function dcast and see #Ananda Mahto's comment). The reshape function in base R might also be of interest to you.
You can use do.call() as well:
do.call("cast", args = list(data = data, formula = paste(ID, '~', Code), fun.aggregate = "sum"))
Arguments can (but don't have to) be delivered as strings.

ddply + summarise function column name input

I am trying to use ddply and summarise together from the plyr package but am having difficulty parsing through column names that keep changing...In my example i would like something that would parse in X1 programatically rather than hard coding in X1 into the ddply function.
setting up an example
require(xts)
require(plyr)
require(reshape2)
require(lubridate)
t <- xts(matrix(rnorm(10000),ncol=10), Sys.Date()-1000:1)
t.df <- data.frame(coredata(t))
t.df <- cbind(day=wday(index(t), label=TRUE, abbr=TRUE), t.df)
t.df.l <- melt(t.df, id.vars=c("day",colnames(t.df)[2]), measure.vars=colnames(t.df)[3:ncol(t.df)])
This is the bit im am struggling with....
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(X1, value))
i do not want to use the term X1 and would like to use something like
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(colnames(t.df)[2], value))
but that comes up with the error: Error in cor(colnames(t.df)[2], value) : 'x' must be numeric
I also tried various other combos that parse in the vector values for the x argument in cor...but for some reason none of them seem to work...
any ideas?
Although this is probably not the intended usage for summarize and there must be much better approaches to your problem, the direct answer to your question is to use get:
ddply(t.df.l, c("day","variable"), summarise, cor(get(colnames(t.df)[2]), value))
Edit: here is for example one approach that is in my opinion better suited to your problem:
ddply(t.df.l, c("day", "variable"), function(x)cor(x["X1"], x["value"]))
Above, "X1" can be also replaced by 2 or the name of a variable holding "X1", etc. It depends how you want to programmatically access the column.

Resources