I'm using the 'extend' function in simr, but I want to be able to confirm that it has appropriately extended the data set as I wanted it to. Is there a function I can use to show me the data set it has created?
I have a dataset including 17 participants in each of 2 groups. Each participant provided two ratings at each of 8 time points, so that I now have variables of participant (id), the difference between the two ratings (my dependent variable, rating_diff), time (8 levels) and group (2 levels, neutral and threat). As I understand it, id is nested within group.
I constructed the following model and calculated the power to detect an interaction between time and group:
model_es <- lmer(rating_diff ~ time + group + time*group + (1|id),
data = data)
fixef(model_es)['time:groupthreat'] <- -0.16
interaction_power0 <- powerSim(model_es, nsim=100, test =
fcompare(rating_diff ~ time + group)) # Power given varies between 86% and 93%, which is too high.
I now want to 'extend' the model to determine the power with only 15 participants in each group. First, I checked the number of rows in my existing dataset:
nrow(getData(model_es)) # gives 252 rows
I worked out that altering the dataset to 15 participants per group should yield 220 rows.
First, I though I ought to be extending within id+group, but that gives too many rows:
model_es_extend0 <- extend(model_es, within = 'id+group', n=30)
nrow(getData(model_es_extend0)) # 954 rows
I tried extending along id instead:
model_es_extend1 <- extend(model_es, along = 'id', n=30)
nrow(getData(model_es_extend1)) #220 rows
This clearly gives the correct number of rows, but how can I verify that there are 15 participants per group, rather than 17 still in one group and 13 in the other?
You should be able to check with:
xtabs(~ group + time, data=getData(model_es_extend1))
I suspect the extend command you want is:
model_es_extend2 <- extend(model_es, within = 'time+group', n=15)
Related
I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).
I'm very new to R (and statistics) and I searched a lot for a possible solution, but couldn't find any.
I have a data set with around 18000 entries, which contain two columns: "rentals" and "season". I want to analyse if there is a difference between the mean of the rentals depending on the season using an one-way ANOVA.
My data looks like this:
rentals
season
23
1
12
1
17
2
16
2
44
3
22
3
2
4
14
4
First I calculate the SD and MEAN of the groups (season):
anova %>%
group_by(season) %>%
summarise(
count_season = n(),
mean_rentals = mean(rentals, na.rm = TRUE),
sd_rentals = sd(rentals, na.rm = TRUE))
This is the result:
Then I perform the one-way ANOVA:
anova_one_way <- aov(season~as.factor(rentals), data = anova)
summary(anova_one_way)
<!-- I use "as.factor" on rentals, because otherwise I'm getting an error with TukeyHSD -->
Result:
Here comes the tricky part. I perform a TukeyHSD test:
TukeyHSD(anova_one_way)
And the results are very disappointing. TukeyHSD returns 376896 rows, while I expect it to return just a few, comparing the seasons with each other. It looks like every single "rentals" row is being handled as a single group. This seems to be very wrong but I can't find the cause. Is this a common TukeyHSD behaviour considering the big data set or is there an error in my code or logic, which causes this enormous unreadable list of values as a return?
Here is a small insight on how it looks like (and it goes on until 376896).
The terms are the wrong way around in your aov() call. Rentals is the outcome (dependent) variable, season is the predictor (independent) variable.
So you want:
anova_one_way <- aov(rentals ~ factor(season), data = anova)
I'm looking at the effects of drought on plants and for that I would need to compare data from before, during and after the drought. However, it has proven to be difficult to select those periods from my data, as the length of days varies. As I have timeseries of several years with daily resolution, I'd like to avoid selecting the periods manually. I have been struggling with this for quite some time and would be really greatful for any tips and advice.
Here's a simplified example of my data:
myData <- tibble(
day = c(1:16),
TWD = c(0,0,0,0.444,0.234,0.653,0,0,0.789,0.734,0.543,0.843,0,0,0,0),
Amp = c(0.6644333,0.4990167,0.3846500,0.5285000,0.4525833,0.4143667,0.3193333,0.5690167,0.2614667,0.2646333,0.7775167,3.5411667,0.4515333,2.3781333,2.4140667,2.6979333)
)
In my data, TWD > 0 means that there is drought, so I identified these periods.
myData %>%
mutate(status = case_when(TWD > 0 ~ "drought",
TWD == 0 ~ "normal")) %>%
{. ->> myData}
I used the following code to get the length of the individual normal and drought periods
myData$group <- with(myData, rep(seq_along(z<-rle(myData$status)$lengths),z))
with(myData, table(group, status))
status
group drought normal
1 0 3
2 3 0
3 0 2
4 4 0
5 0 4
Here's where I get stuck. Ideally, I would like to have the means of Amp for each drought period and compare them to mean of normal period from before and after the drought, and then move to the next drought period. How can I compare the days of e.g. groups 1, 2 and 3? I found a promising solution here Selecting a specific range of days prior to event in R where map(. , function(x) dat[(x-5):(x), ]) was used, but the problem is that I don't have a fixed number of days I want to compare as the number of days depends on the length of the normal and drought periods.
I thought of creating a nested tibble to compare the different groups like here Compare groups with each other with
tibble(value = myData,
group= myData$group %>%
nest(value))
but that creates an error which I believe is because I'm trying to combine a vector and not a tibble.
One possibility would be to use the pairwise Wilcoxon test to compare the means of each group (though, to be honest, I'm not an expert on whether the Wilcoxon is appropriate for this data):
pairwise.wilcox.test(myData$Amp, myData$group, p.adjust.method = 'none', alternative = 'greater')
The column and row indices are the groups, and in this instance you know that the even-numbered groups are the 'drought' periods.
You may need to correct for multiple comparisons (by investigating the p.adjust.method parameter).
I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process.
Here is a mock version of the data:
bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
The code I used to try to sample the correct number of rows is
library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)
However this is not selecting 50 rows. I had initially tried to define size=50 but I got the following error:
Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.
Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???).
What am I missing? For the moment I am stuck here.
Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". Obviously I don't expect someone to do this for me, but could you please provide some advice on
1- Why am I getting the wrong number of rows (probably just syntax?)
2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this?
Many thanks!
M
I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared.
> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
>
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994 6
Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here; sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; so n will need to be adjusted according to the proportionate amount per group:
bank %>%
group_by(var1) %>%
nest() %>%
mutate(n = c(7,0,9,1,13),
samp = map2(data, n, sample_n)) %>%
select(var1, samp) %>%
unnest()
I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.