Appending results of dlply function to original table - r

This question builds on the answer that Simon and James provided here
The dlply function worked well to give me Y estimates within my data subsets. Now, my challenge is getting these Y estimates and residuals back into the original data frame to calculate goodness of fit statistics and for further analysis.
I was able to use cbind to convert the dlply output lists to row vectors, but this doesn't quite work as the result is (sorry about the poor markdown).
model <- function(df){ glm(Y~D+O+A+log(M), family=poisson(link="log"), data=df)}
Modrpt <- ddply(msadata, "Dmsa", function(x)coef(model(x)))
Modest <- cbind(dlply(msadata, "Dmsa", function(x) fitted.values(model(x))))
Subset name | Y_Estimates
-------------------------
Dmsa 1 | c(4353.234, 234.34,...
Dmsa 2 | c(998.234, 2543.55,...
This doesn't really answer the mail, because I need to get the individual Y estimates (separated by commas in the Y_estimates column of the Modest data frame) into my msadata data frame.
Ideally, and I know this is incorrect, but I'll put it here for an example, I'd like to do something like this:
msadata$Y_est <- cbind(dlply(msadata, "Dmsa", function(x)fitted.values(model(x))))
If I can decompose the list into individual Y estimates, I could join this to my msadata data frame by "Dmsa". I feel like this is very similar to Michael's answer here, but something is needed to separate the list elements prior to employing Michael's suggestion of join() or merge(). Any ideas?

In the previous question , I proposed a data.table solution. I think it is more appropriate to what you want to do, since you want to apply models by group then aggregate the results with the original data.
library(data.table)
DT <- as.data.table(df)
models <- DT[,{
mod= glm(Y~D+O+A+log(M), family=poisson(link="log"))
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)
},
by = Dmsa]
Here an application with some data:
## create some data
set.seed(1)
d.AD <- data.frame(
counts = sample(c(10:30),18,rep=TRUE),
outcome = gl(3,1,18),
treatment = gl(3,6),
type = sample(c(1,2),18,rep=TRUE) ) ## type is the grouping variable
## corece data to a data.table
library(data.table)
DT <- as.data.table(d.AD)
## apply models
DT[,{mod= glm(formula = counts ~ outcome + treatment,
family = poisson())
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)},
by = type]
type res fit counts outcome treatment
1: 1 -3.550408e-01 23.25729 15 1 1
2: 1 2.469211e-01 23.25729 29 1 1
3: 1 9.866698e-02 25.48543 28 3 1
4: 1 5.994295e-01 18.13147 29 1 2
5: 1 4.633974e-16 23.00000 23 2 2
6: 1 1.576093e-01 19.86853 23 3 2
7: 1 -3.933199e-01 18.13147 11 1 2
8: 1 -3.456991e-01 19.86853 13 3 2
9: 1 6.141856e-02 22.61125 24 1 3
10: 1 4.933908e-02 24.77750 26 3 3
11: 1 -1.154845e-01 22.61125 20 1 3
12: 2 9.229985e-02 15.56349 17 1 1
13: 2 5.805515e-03 21.87302 22 2 1
14: 2 -1.004589e-01 15.56349 14 1 1
15: 2 2.537653e-16 14.00000 14 1 2
16: 2 -1.603110e-01 21.43651 18 1 3
17: 2 1.662347e-01 21.43651 25 1 3
18: 2 -4.214963e-03 30.12698 30 2 3

Related

Is there a way in R to specify which column in my data is groups and which is blocks in order to do a Friedman's test? I am comparing results to SPSS

I will give a sample below of how my data is organized but every time I run Friedman's using frieman.test(y =, groups = , blocks= ) it gives me an error that my data is not from an unreplicated complete block design despite the fact that it is.
score
treatment
day
10
1
1
20
1
1
40
1
1
7
2
1
100
2
1
58
2
1
98
3
1
89
3
1
40
3
1
70
4
1
10
4
1
28
4
1
86
5
1
200
5
1
40
5
1
77
1
2
100
1
2
90
1
2
33
2
2
15
2
2
25
2
2
23
3
2
54
3
2
67
3
2
1
4
2
2
4
2
400
4
2
16
5
2
10
5
2
90
5
2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
the code above gives me the error I described earlier.
However when I convert the csv data to a matrix, Friedman's works but the answer seems wrong as SPSS gives a different result for the degrees of freedom.
sample_data$treatment <- as.factor(sample_data$treatment) #converting to categorical independent variable
sample_data$day <- as.factor(sample_data$day) #converting to categorical independent variable
data = as.matrix(sample_data)
friedman.test(data)
friedman2 <- friedman.test(y = data$score, groups = data$treatment, blocks = data$day)
summary(friedman2)
Any idea what I am doing incorrectly?
I am aware that Friedman's gives me chi-square but I am also wondering how can I get the test statistic instead of the chi-square value.
I am using Rstudio and I am new to R. And I want to know how to specify groups as treatment and day as blocks.
We could summarise the data by taking the mean of the 'score' and then use that summarised data in friedman.test
sample_data1 <- aggregate(score ~ ., sample_data, FUN = mean)
friedman.test(sample_data1$score, groups = sample_data1$treatment,
blocks = sample_data1$day)

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

Regress each column in a data frame on a vector in R

I want to regress each column in a data set on a vector then return the column which has the highest R-squared value. e.g. I have a vector HAPPY <- (3,2,2,3,1,3,1,3) and I have a data set.
HEALTH CONINC MARITAL SATJOB1 MARITAL2 HAPPY
3 441 5 1 2 3
1 1764 5 1 2 2
2 3087 5 1 2 2
3 3087 5 1 2 3
1 3969 2 1 5 1
1 3969 5 1 2 3
2 4852 5 1 2 2
3 5734 3 1 3 3
Regress "Happy" on each of the columns in the data set on the left, then return the column which has the highest R-squared. Example: lm(Health ~ Happy) if Health had the highest R-squared value, then return Health.
I've tried apply, but can't seem to figure out how to return the regression with the highest R-squared. Any suggestions?
I would break this up into two steps:
1) Determine R-squares for each model
2) Determine which is the highest value
mydf<-data.frame(aa=rpois(8,4),bb=rpois(8,2),cc=rbinom(8,1,.5),
happy=c(3,2,2,3,1,3,1,3))
myRes<-sapply(mydf[-ncol(mydf)],function(x){
mylm<-lm(x~mydf$happy)
theR2<-summary(mylm)$r.squared
return(theR2)
})
names(myRes[which(myRes==max(myRes))])
This was assuming that happy is in your data.frame.
This will do what you want, assuming your data.frame is called 'd'
r2s <- apply(d, 2, function(x) summary(lm(x ~ HAPPY))$r.squared)
names(d)[which.max(r2s)]
You can find out how to extract components of the model, or in this case, a summary of the model, with the str() command. It will give you a read out that helps you access the components of any complex object.
Here's a solution using the colwise() function from the plyr package.
library(plyr)
df = data.frame(a = runif(10), b=runif(10), c=runif(10), d = runif(10))
Rsq = function(x) summary(lm(df$a ~ x))$r.squared
Rsqall = colwise(Rsq)(df[, 2:4])
Rsqall
names(Rsqall)[which.max(Rsqall)]

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources