Splitting data and fitting distributions efficiently - r

For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.
The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.
An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist.
i.e. For a variable that is gamma distributed:
vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)
fitdist(vector1, "gamma")
I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.

One common practice is to split the data using split and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.
df <- data.frame(
group = sample(1:2, 64, TRUE),
setting = sample(1:2, 64, TRUE),
diagnosis = sample(1:2, 64, TRUE),
stay.length = sample(1:5, 64, TRUE)
)
> head(df)
group setting diagnosis var
1 1 1 1 4
2 1 1 2 5
3 1 1 2 4
4 2 1 2 3
5 1 2 2 3
6 1 1 2 5
Perform split and you will get a splitted List :
dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))
> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1
$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1
$`1.2.1`
[1] 4 2 5 4 5 3 5 3
$`2.2.1`
[1] 2 1 4 3 5 4 4
$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5
$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2
Afterwards, we can use lapply to perform whatever function on each group in the list. For example we can apply mean
dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222
.
.
.
.
$`2.2.2`
[1] 2.8
In your case, you can apply fitdist or any other function.
dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))
> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
shape 3.38170 2.2831073
rate 1.04056 0.7573495
.
.
.
$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
shape 4.868843 2.5184018
rate 1.549188 0.8441106

OK, your example isn't quite reproducible here, but I think the answer you want will something like the following:
result <- los_data %>%
group_by(group, setting, diagnosis) %>%
do({
fit <- fitdist(.$my_column, "gamma")
data_frame(group=.$group[1], setting=.$setting[1], diagnosis=.$diagnosis[1], fit = list(fit))
}) %>%
ungroup()
This will give you a data frame of all the fits, with columns for group, setting, diagnosis as well as a list-column which contains the fits for each one. Since it is a list column, you will need to use double brackets to extract individual fits. Example:
# Get the fit in the first row
result$fit[[1]]

Related

How do you randomly assign data into equal sized control and treatment groups in R?

set.seed(31)
resample(1:534, 90, replace = FALSE)
df.orig <- read.csv("project1data.csv")
df.groups <- filter(df.orig, participate == "y")
str(df.groups)
I have randomly selected 90 house numbers from 534 and entered whether or not they were willing to participate in the study into an excel sheet and then I filtered out the people who did not want to participate in the study. How do I now randomly assign the participants into two equally sized groups (control and treatment)
You haven't provided data or code that runs so I'll generate some code to show the idea
set.seed(31)
# Create dataset with three variables
# Participate are the ones that we wish to include in the study.
# You have those in your excel file.
fakedata <- data.frame(houseid=1:534,
size=rbinom(534, size=5, prob=.5),
participate=sample(c("y", "n"), size=534, replace=TRUE))
which produces
head(fakedata)
houseid size participate
1 1 3 y
2 2 4 n
3 3 2 n
4 4 2 y
5 5 4 y
6 6 2 n
Now we can use tidyverse to generate a random permutation of cases/controls. First we create a vector of the correct length (using rep with length) and then we shuffle them using sample.
library("tidyverse")
fakedata %>% # Take data
filter(participate=="y") %>%
mutate(group=sample(rep(c("Case", "Ctrl"), length=n())))
This gives
houseid size participate group
1 1 3 y Case
2 4 2 y Case
3 5 4 y Ctrl
4 7 4 y Case
5 8 1 y Case
6 9 4 y Ctrl
7 13 3 y Case
8 16 1 y Ctrl
.
.
.

Data Summary in R: Using count() and finding an average numeric value [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I am working on a directed graph and need some advice on generating a particular edge attribute.
I need to use both the count of interactions as well as another quality of the interaction (the average length of text used within interactions between the same unique from/to pair) in my visualization.
I am struggling to figure out how to create this output in a clean, scalable way. Below is my current input, solution, and output. I have also included an ideal output along with some things I have tried.
Input
x = read.table(network = "
Actor Receiver Length
1 1 4
1 2 20
1 3 9
1 3 100
1 3 15
2 3 38
3 1 25
3 1 17"
sep = "", header = TRUE)
I am currently using dplyr to get a count of how many times each pair appears to achieve the output below.
I use the following command:
EDGE <- dplyr::count(network, Actor, Receiver )
names(EDGE) <- c("from","to","count")
To achieve my current output:
From To Count
1 1 1
1 2 1
1 3 3
2 3 1
3 1 2
Ideally, however, I like to know the average lengths for each pair as well, or end up with something like this:
From To Count AverageLength
1 1 1 4
1 2 1 20
1 3 3 41
2 3 1 38
3 1 2 21
Is there any way I can do this without creating a host of new data frames and then grafting them back onto the output? I am mostly having issues trying to summarize and count at the same time. My stupid solution has been to simply add "Length" as an argument to the count function, this does not produce anything useful. I could also that it may be useful to combine actor-receiver and then use the summary function to create something to graft onto the frame as a result of the count. In the interest of scaling, however, I would like to figure out if there is a simple and clear way of doing this.
Thank you very much for any assistance with this issue.
A naive solution would be to use cbind() in order to connect these two outputs together. Here is an example code:
Actor <- c(rep(1, 5), 2, 3, 3)
Receiver <- c(1, 2, rep(3, 4), 1, 1)
Length <- c(4, 20, 9, 100, 15, 38, 25, 17)
x <- data.frame("Actor" = Actor,
"Receiver" = Receiver,
"Length" = Length)
library(plyr)
EDGE <- cbind(ddply(x,.(Actor, Receiver), nrow), # This part replace dplyr::count
ddply(x,.(Actor, Receiver), summarize, mean(Length))[ , 3]) # This is the summarize
names(EDGE) <- c("From", "To", "Count", "AverageLength")
EDGE # Gives the expected results
From To Count AverageLength
1 1 1 1 4.00000
2 1 2 1 20.00000
3 1 3 3 41.33333
4 2 3 1 38.00000
5 3 1 2 21.00000

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

Regress each column in a data frame on a vector in R

I want to regress each column in a data set on a vector then return the column which has the highest R-squared value. e.g. I have a vector HAPPY <- (3,2,2,3,1,3,1,3) and I have a data set.
HEALTH CONINC MARITAL SATJOB1 MARITAL2 HAPPY
3 441 5 1 2 3
1 1764 5 1 2 2
2 3087 5 1 2 2
3 3087 5 1 2 3
1 3969 2 1 5 1
1 3969 5 1 2 3
2 4852 5 1 2 2
3 5734 3 1 3 3
Regress "Happy" on each of the columns in the data set on the left, then return the column which has the highest R-squared. Example: lm(Health ~ Happy) if Health had the highest R-squared value, then return Health.
I've tried apply, but can't seem to figure out how to return the regression with the highest R-squared. Any suggestions?
I would break this up into two steps:
1) Determine R-squares for each model
2) Determine which is the highest value
mydf<-data.frame(aa=rpois(8,4),bb=rpois(8,2),cc=rbinom(8,1,.5),
happy=c(3,2,2,3,1,3,1,3))
myRes<-sapply(mydf[-ncol(mydf)],function(x){
mylm<-lm(x~mydf$happy)
theR2<-summary(mylm)$r.squared
return(theR2)
})
names(myRes[which(myRes==max(myRes))])
This was assuming that happy is in your data.frame.
This will do what you want, assuming your data.frame is called 'd'
r2s <- apply(d, 2, function(x) summary(lm(x ~ HAPPY))$r.squared)
names(d)[which.max(r2s)]
You can find out how to extract components of the model, or in this case, a summary of the model, with the str() command. It will give you a read out that helps you access the components of any complex object.
Here's a solution using the colwise() function from the plyr package.
library(plyr)
df = data.frame(a = runif(10), b=runif(10), c=runif(10), d = runif(10))
Rsq = function(x) summary(lm(df$a ~ x))$r.squared
Rsqall = colwise(Rsq)(df[, 2:4])
Rsqall
names(Rsqall)[which.max(Rsqall)]

Resources