Extracting index value from a list in R? - r

I want to extract the random effects from my lmer model, including the person this random effect belongs to. My goal is to create a tibble that has one column for the person and another column for the random effect.
Using coef(modelA)$bib I am able to extract the random effect to a list. Here I also see which person the random effect belongs to.
> coef(modelA)$bib
(Intercept)
31 0.37031060
32 0.49877575
33 0.50586345
34 0.52036187
35 0.49813250
However, adding this to a tibble, this information is lost.
> tibble(randEffectModA)
# A tibble: 65 x 1
`(Intercept)`
<dbl>
1 0.370
2 0.499
3 0.506
4 0.520
5 0.498
Is there a simple way to solve this problem?

Those are rownames and tibbles do not support rownames.
You have few options -
Keep the information in a dataframe instead of tibble so the rownames are maintained.
result <- data.frame(coef(modelA)$bib)
Create the rownames as separate column if you want to use tibbles.
randEffectModA <- data.frame(coef(modelA)$bib)
result <- tibble::tibble(person_no = rownames(randEffectModA),
intercept = unlist(randEffectModA))

Related

R arrange function seems not to be working

I am having a dataframe with timestamps that also have decimal values. I want to calculate the difference between the first and all other events from the same group. To do that I use the following code:
values <- c("1671535501.862424", "1671535502.060679","1671535502.257422",
"1671535502.472993", "1671535502.652619","1671535502.856569",
"1671535503.048685", "1671535503.245988")
column_b <- c("a", "a","a","a","a","a","a","a")
values<-as.numeric(values)
#-- Calculate differences
data <- data.frame(values,column_b) #create data frame
res <- data %>%
group_by(column_b) %>%
arrange(values) %>%
mutate(time=values-lag(values, default = first(values)))
In general, the code does exactly what I expect it to do. It groups them, arranges them, and calculates the difference for each group. The output looks like this:
> res
# A tibble: 8 × 3
# Groups: column_b [2]
values column_b time
<dbl> <fct> <dbl>
1 1671535502. a 0
2 1671535502. a 0.198
3 1671535502. a 0.197
4 1671535502. a 0.216
5 1671535503. a 0.180
6 1671535503. a 0.204
7 1671535503. a 0.192
8 1671535503. a 0.197
Nevertheless, I have my doubts about the math results. If I am not mistaken, the values in this example are prearranged. But even if that was not the case, arrange() should have done the job. Hence, IF it is arranging the values, how can the 4th have a larger value than the 5th? There are multiple examples where we see that it does not make sense. What am I missing?

How can I calculate a new dataframe only for one outcome type?

I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.
In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().

How do I make the list output from the 'by' function in R usable?

I have a set of data with a dependent variable and two factors. I would like randomly sample the dependent variable (with replacement) within each subset of combinations of my two factors (and the number of random samples retrieved should equal the number that existed originally at each combination of the two factors). I've been able to do this using the 'by' function. The problem is the output is a list and I'd like something more accessible but haven't had any luck converting to a data frame. My end goal is to run the simulation described above 1000 times and for each simulation calculate the average of the random samples retrieved for each combination of the factors.
This produces the dataset:
value<-runif(100,5,25)
cat1<-factor(rep(1:10,10))
a<-rep("A",50)
b<-rep("B",50)
cat2<-append(a,b)
data<-as.data.frame(cbind(value,cat1,cat2))
This creates one simulation of random values drawn from the factor levels and
stores that info in a list:
list<-by(data[,"value"],data[,c("cat1","cat2")],function(x) sample(x,length(x),T))
What I'd like to do is wind up with a dataframe that has as columns "Simulation", "AverageValue", "cat1", and "cat2" - so that I would have 1000 simulation lines for each combination of cat1 and cat 2.
Any suggestions on how to make the 'by' output more accessible so I can run a for loop on the output or other suggestions would be great.
Thanks!
As a more general method, you might like to use dplyr rather than by. this way you'll keep your data.frame.
In this case, you would use group_by to group by your cat1 and cat2, rather than by, and use mutate to add a new column on. You could replace new = with value = if you don't want to keep your old data:
library(dplyr)
data %>% group_by(cat1, cat2) %>%
mutate(new = sample(value, length(value), replace = T))
Source: local data frame [100 x 4]
Groups: cat1, cat2 [20]
value cat1 cat2 new
(fctr) (fctr) (fctr) (fctr)
1 13.9639607304707 1 A 13.2139691384509
2 22.6068278681487 2 A 5.27278678957373
3 24.6930849226192 3 A 22.0293137291446
4 16.842244095169 4 A 9.56347029190511
5 18.467006101273 5 A 23.1605510273948
6 20.6661582039669 6 A 24.3043746100739
7 9.37060782220215 7 A 13.9268753770739
8 6.68592340312898 8 A 20.034239795059
9 6.95704637560993 9 A 12.676755907014
10 17.2769332909957 10 A 24.453850784339
.. ... ... ...

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

How to rank cells in a data frame in R WITHOUT adding it to the data frame

I've seen a lot of posts about how to add a rank column to the frame, but none on how to just make a variable, ranks, with the data from the ranking procedure. I figured, heck, why not just take the ranking function from inside the transform data.frame function and use that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
Buuuut that's trying to count occurrences in a year and thus isn't applicalbe to me. I just want to take the information from the cells in in the data frame and rank them. I'm trying to do the Kruskal-Wallis test, but use permutations to find the p-value (which kruskal.test() doesn't do).
I tried to just use rank() on my data frame, but I get this:
Week2_NoAnti Week2_NaN3 Week2_TCS Week2_EDTA <NA> <NA>
1 4 6 10 11 12
<NA> <NA> <NA> <NA> <NA> <NA>
2 3 7 5 8 9
which is less than helpful. The data frame looks like this:
Week2_NoAnti Week2_NaN3 Week2_TCS Week2_EDTA
1 0.0000 0.7665 0.0756 0.1060
2 0.0938 0.9222 0.0806 0.1289
3 0.1243 1.0109 0.1283 0.1882
As previously stated, I'd like to rank the cells. I will also need to later know which column they came from so I can average the ranks that each column got, so I can't just put them all into a vector and rank the vector.
Thanks for the help!
EDIT: Realized a better way to do the data frame might be to have one column with values, and another column with the label. Currently experiencing difficulty making the head() function show more than six results..., but here is what it shows:
Groups agValues
1 Week2_NoAnti 0.0000
2 Week2_NoAnti 0.0938
3 Week2_NoAnti 0.1243
4 Week2_NaN3 0.7665
5 Week2_NaN3 0.9222
6 Week2_NaN3 1.0109
SOLUTION:
Sorry for wasting your time! The above organization made it much easier:
ranks = rank(agValues)
mean(ranks[Groups=="Week2_NoAnti"])
Try
rankmat=matrix(rank(as.vector(yourmatrix)),dim(yourmatrix))
Here you're transforming your matrix into a vector then taking the ranks and transforming the vector back into a matrix of correct dimensions.
For the edited data frame that you just posted do this
ranked.df <-df[order(df$agValues),] #decreasing = FALSE by default
#and df is your data.frame

Resources