Is there a way in R to specify which column in my data is groups and which is blocks in order to do a Friedman's test? I am comparing results to SPSS - r

I will give a sample below of how my data is organized but every time I run Friedman's using frieman.test(y =, groups = , blocks= ) it gives me an error that my data is not from an unreplicated complete block design despite the fact that it is.
score
treatment
day
10
1
1
20
1
1
40
1
1
7
2
1
100
2
1
58
2
1
98
3
1
89
3
1
40
3
1
70
4
1
10
4
1
28
4
1
86
5
1
200
5
1
40
5
1
77
1
2
100
1
2
90
1
2
33
2
2
15
2
2
25
2
2
23
3
2
54
3
2
67
3
2
1
4
2
2
4
2
400
4
2
16
5
2
10
5
2
90
5
2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
the code above gives me the error I described earlier.
However when I convert the csv data to a matrix, Friedman's works but the answer seems wrong as SPSS gives a different result for the degrees of freedom.
sample_data$treatment <- as.factor(sample_data$treatment) #converting to categorical independent variable
sample_data$day <- as.factor(sample_data$day) #converting to categorical independent variable
data = as.matrix(sample_data)
friedman.test(data)
friedman2 <- friedman.test(y = data$score, groups = data$treatment, blocks = data$day)
summary(friedman2)
Any idea what I am doing incorrectly?
I am aware that Friedman's gives me chi-square but I am also wondering how can I get the test statistic instead of the chi-square value.
I am using Rstudio and I am new to R. And I want to know how to specify groups as treatment and day as blocks.

We could summarise the data by taking the mean of the 'score' and then use that summarised data in friedman.test
sample_data1 <- aggregate(score ~ ., sample_data, FUN = mean)
friedman.test(sample_data1$score, groups = sample_data1$treatment,
blocks = sample_data1$day)

Related

How to make a normally distributed variable depend on entries and time in R?

I'm trying to generate a dataset of cross sectional time series to estimate uses of different models.
In this dataset, I have a ID variable and time variable. I'm trying to add a normally distributed variable that depends on the two identifications. In other words, how do I create a variable that recongizes both ID and time in R?
If my question appears uncertain, feel free to ask any questions.
Thanks in advance.
df2 <- read.table(
text =
"Year,ID,H,
1,1,N(2.3),
2,1,N(2.3),
3,1,N(2.3),
1,2,N(0.1),
2,2,N(0.1),
3,2,N(0.1),
", sep = ",", header = TRUE)
Assuming that the data in the dataframe df looks like
ID
Time
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
3
1
3
2
3
3
3
4
you can generate a variable y that depends on ID and time as the sum of two random normal distributions (yielding another normal distribution) that depend on ID and time respectively:
set.seed(42)
df = data.frame(
ID = rep(1:4, each=3),
time = rep(1:3, times=4)
)
df$y = rnorm(nrow(df), mean=df$ID, sd=1+0.1*df$ID) +
rnorm(nrow(df), mean=df$time, sd=0.05*df$time)
# Output:
ID time y
1 1 1 3.438611
2 1 2 2.350953
3 1 3 4.379443
4 1 4 5.823339
5 2 1 3.470909
6 2 2 3.607005
7 2 3 6.447756
8 2 4 6.150432
9 3 1 6.608619
10 3 2 4.740341
11 3 3 7.670543
12 3 4 10.215574
Note that the underlying normal distributions depend on both ID and time. That is in contrast to your example table above where it looks like it solely depends on ID -- namely resulting in a single normal distribution per ID that is independent of the time variable.

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

Counting in R and preserving the order of occurence

Suppose I have generated a vector using the following statement:
x1 <- rep(4:1, sample(1:100,4))
Now, when I try to count the number of occurrences using the following commands
count(x1)
x freq
1 1 40
2 2 57
3 3 3
4 4 46
or
as.data.frame(table(x1))
x1 Freq
1 1 40
2 2 57
3 3 3
4 4 46
In both cases, the order of occurrence is not preserved. I want to preserve the order of occurrence, i.e. the output should be like this
x1 Freq
1 4 46
2 3 3
3 2 57
4 1 40
What is the cleanest way to do this? Also, is there a way to coerce a particular order?
You are looking for rle function
rle(x1)
## Run Length Encoding
## lengths: int [1:4] 12 2 23 52
## values : int [1:4] 4 3 2 1
You can order the table like this:
set.seed(42)
x1 <- rep(4:1, sample(1:100,4))
table(x1)[order(unique(x1))]
# x1
# 4 3 2 1
# 92 93 29 81
One way is to convert your variable to factor and specify the desired order with the levels argument. From ?table: "table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels"; "It is best to supply factors rather than rely on coercion.". So by converting to factor yourself, you are in charge over the coercion and the order set by levels.
x1 <- rep(factor(4:1, levels = 4:1), sample(1:100,4))
table(x1)
# x1
# 4 3 2 1
# 90 72 11 16

Appending results of dlply function to original table

This question builds on the answer that Simon and James provided here
The dlply function worked well to give me Y estimates within my data subsets. Now, my challenge is getting these Y estimates and residuals back into the original data frame to calculate goodness of fit statistics and for further analysis.
I was able to use cbind to convert the dlply output lists to row vectors, but this doesn't quite work as the result is (sorry about the poor markdown).
model <- function(df){ glm(Y~D+O+A+log(M), family=poisson(link="log"), data=df)}
Modrpt <- ddply(msadata, "Dmsa", function(x)coef(model(x)))
Modest <- cbind(dlply(msadata, "Dmsa", function(x) fitted.values(model(x))))
Subset name | Y_Estimates
-------------------------
Dmsa 1 | c(4353.234, 234.34,...
Dmsa 2 | c(998.234, 2543.55,...
This doesn't really answer the mail, because I need to get the individual Y estimates (separated by commas in the Y_estimates column of the Modest data frame) into my msadata data frame.
Ideally, and I know this is incorrect, but I'll put it here for an example, I'd like to do something like this:
msadata$Y_est <- cbind(dlply(msadata, "Dmsa", function(x)fitted.values(model(x))))
If I can decompose the list into individual Y estimates, I could join this to my msadata data frame by "Dmsa". I feel like this is very similar to Michael's answer here, but something is needed to separate the list elements prior to employing Michael's suggestion of join() or merge(). Any ideas?
In the previous question , I proposed a data.table solution. I think it is more appropriate to what you want to do, since you want to apply models by group then aggregate the results with the original data.
library(data.table)
DT <- as.data.table(df)
models <- DT[,{
mod= glm(Y~D+O+A+log(M), family=poisson(link="log"))
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)
},
by = Dmsa]
Here an application with some data:
## create some data
set.seed(1)
d.AD <- data.frame(
counts = sample(c(10:30),18,rep=TRUE),
outcome = gl(3,1,18),
treatment = gl(3,6),
type = sample(c(1,2),18,rep=TRUE) ) ## type is the grouping variable
## corece data to a data.table
library(data.table)
DT <- as.data.table(d.AD)
## apply models
DT[,{mod= glm(formula = counts ~ outcome + treatment,
family = poisson())
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)},
by = type]
type res fit counts outcome treatment
1: 1 -3.550408e-01 23.25729 15 1 1
2: 1 2.469211e-01 23.25729 29 1 1
3: 1 9.866698e-02 25.48543 28 3 1
4: 1 5.994295e-01 18.13147 29 1 2
5: 1 4.633974e-16 23.00000 23 2 2
6: 1 1.576093e-01 19.86853 23 3 2
7: 1 -3.933199e-01 18.13147 11 1 2
8: 1 -3.456991e-01 19.86853 13 3 2
9: 1 6.141856e-02 22.61125 24 1 3
10: 1 4.933908e-02 24.77750 26 3 3
11: 1 -1.154845e-01 22.61125 20 1 3
12: 2 9.229985e-02 15.56349 17 1 1
13: 2 5.805515e-03 21.87302 22 2 1
14: 2 -1.004589e-01 15.56349 14 1 1
15: 2 2.537653e-16 14.00000 14 1 2
16: 2 -1.603110e-01 21.43651 18 1 3
17: 2 1.662347e-01 21.43651 25 1 3
18: 2 -4.214963e-03 30.12698 30 2 3

Return rows of data frame that meet multiple criteria in R (panel data random sample)

I am hoping to create a random sample from panel data based on the unique id.
For instance if you start with:
e = data.frame(id=c(1,1,1,2,2,3,3,3,4,4,4,4), data=c(23,34,45,1,23,45,6,2,9,39,21,1))
And you want a random sample of 2 unique ids:
out = data.frame(id=c(1,1,1,3,3,3), data=c(23,34,45,45,6,2))
Although sample gives me random unique ids
sample( e$id ,2) # give c(1,3)
I can't figure out how to use logical calls to return all the desired data.
I have tried a number of things including:
e[ e$id == sample( e$id ,2) ] # only returns 1/2 the data
Any ideas??? Its killing me.
I'm not entirely sure what your expected result should be, but does this work for what you're trying to do?
> e[e$id %in% sample(e$id, 2), ]
id data
6 3 45
7 3 6
8 3 2
9 4 9
10 4 39
11 4 21
12 4 1
Or maybe you want this:
> e[e$id %in% sample(unique(e$id), 2), ]
id data
1 1 23
2 1 34
3 1 45
9 4 9
10 4 39
11 4 21
12 4 1

Resources