I have a set of observations for many subjects and I would like to fit a model for each subject.
I"m using the packages data.table and fitdistrplus, but could also try to use dlpyr.
Say my data are of this form:
#subject_id #observation
1 35
1 38
2 44
2 49
Here's what I've tried so far:
subject_models <- dt[,fitdist(observation, "norm", method = "mme"), by=subject_id]
This causes an error I think because the call to fitdist returns a fitdist object which is not possible to store in a datatable/dataframe.
Is there any intuitive way to do this using data.table or dplyr?
EDIT: A dplyr answer was provided, but I would appreciate a data.table one as well, I'll try to run some benchmarks against the two.
This can be easily achieved with the purrr package
I assume its the same thing #alistaire suggested
library(purrr)
library(dplyr)
library(fitdistrplus)
dt %>% split(dt$subject_id) %>% map( ~ fitdist(.$observation, "norm", method = "mme"))
Alternatively, without purrr,
dt %>% split(dt$subject_id) %>% lapply(., function(x) fitdist(x$observation, "norm", method = "mme"))
Related
I am wondering if there is a package or fast way to generate a statistical summary table for the result of clustering.
I imagine I can choose variables of interest and group by cluster number and then calculate mean and max and etc. I am looking for a fast way to do it. Is there any package I can use?
Thanks
The fastest and easiest way might depend on the exact results you want. The easiest approach is probably summary() in base R, the more versatile is to use the package dplyr with its functions group_by() and summarize(). For specific type of data, other packages may provide a more practical summary.
An example:
DF <- data.frame(groups = sample(LETTERS, 20, replace = TRUE),
var = runif(20))
summary(DF)
library(dplyr)
DF %>%
group_by(groups) %>%
summarize(mean_by_group = mean(var),
number = n())
I want to demean all my columns using dplyr. I tried but failed using the "do()" command.
I basically want to replicate the following using easier dplyr commands:
tickers <- c(rep(1,10),rep(2,10))
df <- data.frame(cbind(tickers,rep(1:20),rep(2:21)))
colnames(df) <- c("tickers","col1","col2")
df %>% group_by(tickers)
apply(df[,2:3],2,function(x) x - mean(x))
I am sure this can be done much better using dplyr.
Thanks!
If we are using dplyr, we can do this with mutate_each and use any of the methods mentioned in ?select to match the columns. Here, I am using matches which can take regular expression as pattern.
library(dplyr)
df %>%
mutate_each(funs(.-mean(.)), matches('^col')) %>%
select(-tickers)
But this can be done also using base R:
df[2:3]-colMeans(df[2:3])[col(df[2:3])]
The colMeans output is a vector which can be replicated so that the lengths will be the same.
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))
I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).
Example code:
countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
then what I'd like to do is ...
library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do(...) bit seems very clunky.
Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.
Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.
Edit: results from suggestions
Three solutions:
My suggestion in post: elapsed, 146.765 seconds.
#joran's suggestion below: 6.902 seconds
#eddi's suggestion in the comments, using data.table: 6.715 seconds.
I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.
Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)
The method summarise_each mentioned in joran's answer from 2014 has been deprecated.
Instead, please use summarize_all() or summarize_at().
The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.
Instead, please use summarize()/summarise() combined with across().
This paper that was published for reshape package (Wickham 2007) gave this example:
library(reshape2)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
dcast(ffm, variable ~ ., c(min, max))
Similarly, this doesn't work in reshape2 but appears to work in Wickham 2007
dcast(ffm, variable ~ ., summary)
However the cast function is giving an error. How can I get function to work?
The paper is for the reshape package, not the reshape2 package. You have also not reproduced the example as it was written. It should be:
library("reshape") # not explicit in the paper, but implied since it is for the reshape pacakge
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
cast(ffm, treatment ~ rep, c(min, max))
Note that the function call is cast, not dcast. That change was one of the major changes between the two packages. Another was the dropping of multiple aggregation at the same time as reshaping as this was considered to be better handled by the plyr package. If you use the reshape package (which is still available from CRAN), the examples work.