TukeyHSD returns too many values in R - r

I'm very new to R (and statistics) and I searched a lot for a possible solution, but couldn't find any.
I have a data set with around 18000 entries, which contain two columns: "rentals" and "season". I want to analyse if there is a difference between the mean of the rentals depending on the season using an one-way ANOVA.
My data looks like this:
rentals
season
23
1
12
1
17
2
16
2
44
3
22
3
2
4
14
4
First I calculate the SD and MEAN of the groups (season):
anova %>%
group_by(season) %>%
summarise(
count_season = n(),
mean_rentals = mean(rentals, na.rm = TRUE),
sd_rentals = sd(rentals, na.rm = TRUE))
This is the result:
Then I perform the one-way ANOVA:
anova_one_way <- aov(season~as.factor(rentals), data = anova)
summary(anova_one_way)
<!-- I use "as.factor" on rentals, because otherwise I'm getting an error with TukeyHSD -->
Result:
Here comes the tricky part. I perform a TukeyHSD test:
TukeyHSD(anova_one_way)
And the results are very disappointing. TukeyHSD returns 376896 rows, while I expect it to return just a few, comparing the seasons with each other. It looks like every single "rentals" row is being handled as a single group. This seems to be very wrong but I can't find the cause. Is this a common TukeyHSD behaviour considering the big data set or is there an error in my code or logic, which causes this enormous unreadable list of values as a return?
Here is a small insight on how it looks like (and it goes on until 376896).

The terms are the wrong way around in your aov() call. Rentals is the outcome (dependent) variable, season is the predictor (independent) variable.
So you want:
anova_one_way <- aov(rentals ~ factor(season), data = anova)

Related

Comparing groups with different lengths in a tibble

I'm looking at the effects of drought on plants and for that I would need to compare data from before, during and after the drought. However, it has proven to be difficult to select those periods from my data, as the length of days varies. As I have timeseries of several years with daily resolution, I'd like to avoid selecting the periods manually. I have been struggling with this for quite some time and would be really greatful for any tips and advice.
Here's a simplified example of my data:
myData <- tibble(
day = c(1:16),
TWD = c(0,0,0,0.444,0.234,0.653,0,0,0.789,0.734,0.543,0.843,0,0,0,0),
Amp = c(0.6644333,0.4990167,0.3846500,0.5285000,0.4525833,0.4143667,0.3193333,0.5690167,0.2614667,0.2646333,0.7775167,3.5411667,0.4515333,2.3781333,2.4140667,2.6979333)
)
In my data, TWD > 0 means that there is drought, so I identified these periods.
myData %>%
mutate(status = case_when(TWD > 0 ~ "drought",
TWD == 0 ~ "normal")) %>%
{. ->> myData}
I used the following code to get the length of the individual normal and drought periods
myData$group <- with(myData, rep(seq_along(z<-rle(myData$status)$lengths),z))
with(myData, table(group, status))
status
group drought normal
1 0 3
2 3 0
3 0 2
4 4 0
5 0 4
Here's where I get stuck. Ideally, I would like to have the means of Amp for each drought period and compare them to mean of normal period from before and after the drought, and then move to the next drought period. How can I compare the days of e.g. groups 1, 2 and 3? I found a promising solution here Selecting a specific range of days prior to event in R where map(. , function(x) dat[(x-5):(x), ]) was used, but the problem is that I don't have a fixed number of days I want to compare as the number of days depends on the length of the normal and drought periods.
I thought of creating a nested tibble to compare the different groups like here Compare groups with each other with
tibble(value = myData,
group= myData$group %>%
nest(value))
but that creates an error which I believe is because I'm trying to combine a vector and not a tibble.
One possibility would be to use the pairwise Wilcoxon test to compare the means of each group (though, to be honest, I'm not an expert on whether the Wilcoxon is appropriate for this data):
pairwise.wilcox.test(myData$Amp, myData$group, p.adjust.method = 'none', alternative = 'greater')
The column and row indices are the groups, and in this instance you know that the even-numbered groups are the 'drought' periods.
You may need to correct for multiple comparisons (by investigating the p.adjust.method parameter).

Identical values generated from random samples from a uniform distribution in dplyr

This is a follow up to previous question. My question was not fully formulated and therefore not fully answered in my last post. Forgive me, I'm new to using stack overflow.
My professor has assigned a problem set, and we are required to use dplyr and other tidyverse packages. I'm very aware that most (if not all) the tasks that I'm trying to execute are possible in base r, but that's not in agreement with my instructions.
First we are asked to generate a tibble of 1000 random samples from a uniform distribution:
2a. Create a new tibble called uniformDf containing a variable called unifSamples that contains 10000 random samples from a uniform distribution. You should use the runif() function to create the uniform samples. {r 2a}
uniformDf <- tibble(unifSamples = runif(1000))
This goes well.
Then we are asked to loop thru this tibble 1000 times, each time choosing 20 random samples and computing the mean and saving it to a tibble:
2c. Now let's loop through 1000 times, sampling 20 values from a uniform distribution and computing the mean of the sample, saving this mean to a variable called sampMean within a tibble called uniformSampleMeans. {r 2c}
unif_sample_size = 20 # sample size
n_samples = 1000 # number of samples
# set up q data frame to contain the results
uniformSampleMeans <- tibble(sampMean=rep(NA,n_samples))
# loop through all samples. for each one, take a new random sample,
# compute the mean, and store it in the data frame
for (i in 1:n_samples){
uniformSampleMeans$sampMean[i] <- uniformDf %>%
sample_n(unif_sample_size) %>%
summarize(sampMean = mean(sampMean))
}
This all runs, well, I believe until I look at my uniformSampleMeans tibble. Which looks like this:
1 0.471271611726843
2 0.471271611726843
3 0.471271611726843
4 0.471271611726843
5 0.471271611726843
6 0.471271611726843
7 0.471271611726843
...
1000 0.471271611726843
All the values are identical! Does anyone have any insight as to why my output is like this? I'd be less concerned if they varied by +/- 0.000x values seeing as how this is from a distribution that ranges from 0 to 1 but the values are all identical even out to the 15th decimal place! Any help is much appreciated!
The following selects random unif_sample_size rows and gives it's mean
library(dplyr)
uniformDf %>% sample_n(unif_sample_size) %>% pull(unifSamples) %>% mean
#[1] 0.5563638
If you want to do this n times use replicate and repeat it n times
n <- 10
replicate(n, uniformDf %>%
sample_n(unif_sample_size) %>%
pull(unifSamples) %>% mean)
#[1] 0.5070833 0.5259541 0.5617969 0.4695862 0.5030998 0.5745950 0.4688153 0.4914363 0.4449804 0.5202964

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

For loops, lists-of-lists, and conditional analyses (in R)

I'm trying to compute a reaction time score for every subject in an experiment, but only using a subset of trials, contingent on the subject's performance.
Each subject took a quiz on 16 items. They then took a test on the same 16 items. I'd like to get, for each subject, an average reaction time score but only for those items they got both quiz and test questions correct.
My data file looks something like this:
subject quizitem1 quizitem2 testitem1 testitem2 RT1 RT2
1 1 0 1 1 5 10
2 0 1 0 1 3 7
Ideally I'd like another column that represents the average reaction time for each subject when considering only RTs for items i with 1s under both quizitem[i] and testitem[i]. To use the above example, the column would look like this:
newDV
5
7
...since subject 1 only got item 1 correct on both measures, and subject 2 only got item 2 correct on both measures.
I've started by making three vectors, to help keep data from relevant items in the correct order.
quizacclist = c(quizitem1, quizitem2)
testacclist = c(testitem1, testitem2)
RTlist = c(RT1, RT2)
Each of these new vectors is very long, appending the RT1s from all subjects to the RT2s for all subjects, and so forth.
I've tried computing this column using for loops, but can't quite figure out what conditions would be necessary to restrict the analysis to the items meeting the above criteria.
Here is my attempt:
attach(df)
i = 0
j = 0
for(i in subject) {
for(j in 1:16) {
denominator[i] = sum(quizacclist[i*j]==1 & testacclist[i*j]==1)
qualifiedindex[i] = ??
numerator[i] = sum(RTlist[qualifiedindex])
meanqualifiedRT[i] = numerator[i]/denominator[i]
}
}
The denominator variable should be counting the number of items for which a subject has gotten both the quiz and test questions correct.
The numerator variable should be adding up all the RTs for items that contributed to the denominator variable; that is, got quiz and test questions correct for that item.
My specific question at this point is: How do I specify this qualifiedindex? As I conceive of it, it should be a list of lists; each index within the macro list corresponds to a subject, and each subject has a list of their own that pinpoints which items have 1s under both quizacclist[i] and testacclist[i].
For instance:
Qualifiedindex = ([1,5,9],[2,6],[8,16],etc)
Ideally, this structure would allow the numerator variable to only add up RTs that meet the accuracy conditions.
How can this list-within-a-list be created?
Alternatively, is there a better way of achieving my aim?
Any help would be appreciated!
Thanks in advance,
Adam
Here's a solution using base R reshape and then dplyr:
quiz_long <- reshape(quiz, direction = "long",
varying = -1, sep = "", idvar = "subject",
timevar = "question")
quiz_long %>%
filter(quizitem == 1 & testitem == 1) %>%
group_by(subject) %>%
summarise(mean(RT))
Note this will only include subjects who got at least one usable question. An alternative which will have NA for those subjects:
quiz_long %>%
mutate(RT = replace(RT, quizitem != 1 | testitem != 1, NA)) %>%
group_by(subject) %>%
summarise(mean_RT = mean(RT, na.rm = TRUE))
Thanks for the promising suggestion Nick! I've tried that out but currently stuck dealing with an error prompted by the mutate feature, where the replacement has a different number of rows than the data. Is there a common reason for why that occurs?
Thanks again,
Adam

R: Shapiro test by group won't produce p-values and corrupt data frame warning

This question has been asked before, but the solutions posed only partially solve my problem, and I've been working on this for days now. I felt it was time to seek help, even if the topic has been addressed previously. I apologize for any inconvenience.
I have a very large data.frame in R with 6288 observations of 11 variables. I want to run a Shapiro test by group on each variable, but grouped by two different factors (Number and Treatment). A much reduced sample data set with one variable is provided for example:
data <- data.frame(Number=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
Treatment=c("High","High","High","High","High","High","Low",
"Low","Low","Low","Low","Low","High","High","High",
"High","High","High","Low","Low","Low","Low","Low",
"Low"),
FW=c(746,500,498,728,626,580,1462,738,1046,568,320,578,654,664,
660,596,1110,834,486,548,688,776,510,788))
I want to run a Shapiro test on FW by Number and by Treatment, so I'd have a test for 1High, 1Low, 2High, 2Low, etc. I'd like to have data for both the W statistic and the P-value. The original dataset contains 16 observations per group (1High,1Low,etc.; total groups=400), and an occasional NA; this sample dataset contains 6 observations per group (1High, 1Low, 2High, 2Low; groups=4).
The following code was previously posted as a solution to this problem of shapiro tests by groups:
res<-aggregate(cbind(P.value=data$FW)~data$Number+data$Treatment,data,FUN=shapiro.test)
I've also experimented with a number of other ways of grouping, but nothing seems to work. The above code comes closest.
The code above using aggregate groups my data appropriately, and gives me the W statistic, but it won't give me the P value (the column header says "P.value", but this is not the P value, it's the W statistic, I've confirmed this several ways). It also gives me the following warning message:
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
When I did a Google search for this warning, the results suggest it is a bug in the data.frame, but I can't figure out how to solve it. I'm not even sure it really is a bug in this case.
Can anyone help by providing some insight into the warning message, or another way to do the Shapiro test by group?
You're getting that error because shapiro.test returns a list and aggregate expects the result of the aggregation to be a vector or a single number.
aggregate sees the list, takes the first element of the list by default, and tells you why it's unhappy (in admittedly vague terms). But it still gives you the Shapiro-Wilk statistic since that's the first element of the list returned from shapiro.test.
You can make a slight modification to your existing code that will get you what you want without issue:
aggregate(formula = FW ~ Number + Treatment,
data = data,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
# Number Treatment FW.W FW.V2
# 1 1 High 0.88995051 0.31792857
# 2 2 High 0.78604502 0.04385663
# 3 1 Low 0.93305840 0.60391888
# 4 2 Low 0.86456934 0.20540230
Note that the rightmost columns correspond to the statistic and p-value.
This is directly extracting the statistic and p-value from the list, thereby making the result of aggregation a single vector, which makes aggregate happy.
Another option would be to use the data.table package, available from CRAN.
library(data.table)
DT <- data.table(data)
DT[,
.(W = shapiro.test(FW)$statistic, P.value = shapiro.test(FW)$p.value),
by = .(Number, Treatment)]
# Number Treatment W P.value
# 1: 1 High 0.8899505 0.31792857
# 2: 1 Low 0.9330584 0.60391888
# 3: 2 High 0.7860450 0.04385663
# 4: 2 Low 0.8645693 0.20540230
The dplyr package is handy for groupwise operations:
library(dplyr)
data %>%
group_by(Number, Treatment) %>%
summarise(statistic = shapiro.test(FW)$statistic,
p.value = shapiro.test(FW)$p.value)
Number Treatment statistic p.value
1 1 High 0.8899505 0.31792857
2 1 Low 0.9330584 0.60391888
3 2 High 0.7860450 0.04385663
4 2 Low 0.8645693 0.20540230
The simple dplyr answer didn't do it for me as it did not do the shapiro test on each grouped variable, but only did it once, so here's my own solution using nesting :
shapiro <- data %>%
group_by(!!sym(groupvar)) %>%
group_nest() %>%
mutate(shapiro = map(.data$data, ~ shapiro_test(.x, !!sym(quantvar)))) %>%
select(-data) %>%
unnest(cols = shapiro) %>%
print()

Resources