Long-format Regression model refering to older measures? - r

Imagine I have an experiment in which I take repeated measures to a group of individuals.
The following diagram represents the measures taken to each individual at each time.
The data could be stored on a wide format table, with one row for each individual.
ID x1 x2 x3 x4 y1 y2 y3 y4
Or it could also be stored in a more compact long format, like this:
ID T X Y
Where T would be 1, 2, 3 or 4. For the moment just leave it simple, T increments are always the same, 1 unit.
I've been seeing and using this long format to fit regression models with dummy variables.
I use to do it with R (with the syntax "Y ~ X*T" where T is a factor) but it can be done on many other programs in different ways.
In this situation you can find the relationship between each y and its corresponding x (for the same time).
It would be similar to say:
y1 = a1 + b1·X1
y2 = a2 + b2·X2
y3 = a3 + b3·X3
But you get more power because you analyze all the data together.
Usually I do it with lme4 in order to take into account the repeated measures. But forget it for the moment.
My question is, Is it possible to use the "long format" to find relationships such as this?
y1 = a10 + a11·X1
y2 = a20 + a21·X1 + a22·X2
y3 = a30 + a31·X1 + a32·X2 + a33·X3
I mean, every "y" not only depends on the "x" in the same time but also in previous "x". (Kinda cumulative effect).
Or, Am I forced to use a wide format and create new variables instead?
I think the problem with the wide format is that we will lose the explicit dependency with T, the time. And I woulkd like to see how the outcome depends on it.
And I find it easier to work with long format.
If you want a very simple reproducible example:
set.seed(1)
ID <- rep(1:4,each=4)
XX <- round(runif(16),3)
TT <- rep(1:4, 4)
YY <- ave(XX*TT,ID, FUN = cumsum)
data.frame(ID,TT,XX, YY)
ID TT XX YY
1 1 0.266 0.266
1 2 0.372 1.010
1 3 0.573 2.729
1 4 0.908 6.361
2 1 0.202 0.202
2 2 0.898 1.998
2 3 0.945 4.833
2 4 0.661 7.477
3 1 0.629 0.629
3 2 0.062 0.753
3 3 0.206 1.371
3 4 0.177 2.079
4 1 0.687 0.687
4 2 0.384 1.455
4 3 0.770 3.765
4 4 0.498 5.757
Any solution not relying on R is also welcome.

Related

How to estimate means from same column in large number of dataframes, based upon a grouping variable in R

I have a huge amount of DFs in R (>50), which correspond to different filtering I've performed, here's an example of 7 of them:
Steps_Day1 <- filter(PD2, Gait_Day == 1)
Steps_Day2 <- filter(PD2, Gait_Day == 2)
Steps_Day3 <- filter(PD2, Gait_Day == 3)
Steps_Day4 <- filter(PD2, Gait_Day == 4)
Steps_Day5 <- filter(PD2, Gait_Day == 5)
Steps_Day6 <- filter(PD2, Gait_Day == 6)
Steps_Day7 <- filter(PD2, Gait_Day == 7)
Each of the dataframes contains 19 variables, however I'm only interested in their speed (to calculate mean) and their subjectID, as each subject has multiple observations of speed in the same DF.
An example of the data we're interested in, in dataframe - Steps_Day1:
Speed SubjectID
0.6 1
0.7 1
0.7 2
0.8 2
0.1 2
1.1 3
1.2 3
1.5 4
1.7 4
0.8 4
The data goes up to 61 pts. and each particpants number of observations is much larger than this.
Now what I want to do, is create a code that automatically cycles through each of 50 dataframes (taking the 7 above as an example) and calculates the mean speed for each participant and stores this and saves it in a new dataframe, alongside the variables containing to mean for each participant in the other DFs.
An example of Steps day 1 (Values not accurate)
Speed SubjectID
0.6 1
0.7 2
1.2 3
1.7 4
and so on... Before I end up with a final DF containing in column vectors the means for each participant from each of the other data frames, which may look something like:
Steps_Day1 StepsDay2 StepsDay3 StepsDay4 SubjectID
0.6 0.8 0.5 0.4 1
0.7 0.9 0.6 0.6 2
1.2 1.1 0.4 0.7 3
1.7 1.3 0.3 0.8 4
I could do this through some horrible, messy long code - but looking to see if anyone has more intuitive ideas please!
:)
To add to the previous answer, I agree that it is much easier to do this without creating a new data frame for each day. Using some generated data, you can achieve your desired results as follows:
# Generate some data
df <- data.frame(
day = rep(1:5, 1, 100),
subject = rep(5:10, 1, 100),
speed = runif(500)
)
df %>%
group_by(day, subject) %>%
summarise(avg_speed = mean(speed)) %>%
pivot_wider(names_from = day,
names_prefix = "Steps_Day",
values_from = avg_speed)
# A tibble: 6 × 6
subject Steps_Day1 Steps_Day2 Steps_Day3 Steps_Day4 Steps_Day5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 0.605 0.416 0.502 0.516 0.517
2 6 0.592 0.458 0.625 0.531 0.460
3 7 0.475 0.396 0.586 0.517 0.449
4 8 0.430 0.435 0.489 0.512 0.548
5 9 0.512 0.645 0.509 0.484 0.566
6 10 0.530 0.453 0.545 0.497 0.460
You don't include a MCVE of your dataset so I can't test out a solution, but it seems like a pretty simple problem using tidyverse solutions.
First, why do you split PD2 into separate dataframes? If you skip that, you can just use group and summarize to get the average for groups:
PD2 %>%
group_by(Gait_Day, SubjectID) %>%
summarize(Steps = mean(Speed))
This will give you a "long-form" data.frame with 3 variables: Gait_Day, SubjectID, and Steps, which has the mean speed for that subject and day. If you want it in the format you show at the end, just pivot into "wide-form" using pivot_wider. You can see this question for further explaination on that: How to reshape data from long to wide format

Why am I getting different predicted probabilities on random forest rf$votes vs. predict()?

I ran randomForest on a dataset with binary outcome and want the predicted probabilities (on the same dataset - I don't need separate train/test for this). I was expecting the values for p1 and p2 below to be the same, but clearly they are not. I haven't been able to find a clear description of how they are different. Any help would be appreciated.
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
rf = randomForest(factor(admit)~., data = mydata)
p1 = predict(rf, mydata[,c(2:4)], type = "prob")
p2 <- rf$votes
> head(p1)
0 1
1 0.926 0.074
2 0.584 0.416
3 0.166 0.834
4 0.722 0.278
5 0.968 0.032
6 0.258 0.742
> head(p2)
0 1
1 0.8324324 0.16756757
2 0.7663043 0.23369565
3 0.2447917 0.75520833
4 0.9695431 0.03045685
5 0.9264706 0.07352941
6 0.3351351 0.66486486

R: How to run between group t-tests?

I am attempting to run a series of t-tests in R when splitting groups in the same dataset. I have easily been able to group data using group_by and selecting necessary variables. I also understand how to easily run t-tests using the t.test functions but these do not solve the problem of groups.
The data set consists of a group of participants completing an intervention with two different conditions, and varying degrees of load (see below for example).
Participant Condition Load var.1 var.2 var.3
P01 a 1 834.99 0.383 0.342
P01 a 2 917.22 0.342 0.301
P01 a 3 995.24 0.305 0.263
P01 b 1 1074.22 0.276 0.235
P01 b 2 1156.46 0.247 0.208
P01 b 3 871.41 0.307 0.277
P02 a 1 945.10 0.290 0.260
P02 a 2 1010.39 0.272 0.239
P02 a 3 1096.92 0.265 0.234
P02 b 1 1171.91 0.227 0.195
P02 b 2 664.00 0.260 0.191
P02 b 3 711.92 0.238 0.175
P03 a 1 782.02 0.211 0.154
P03 a 2 858.70 0.174 0.134
P03 a 3 915.21 0.154 0.114
P03 b 1 668.22 0.178 0.207
P03 b 2 723.92 0.243 0.186
P03 b 3 788.31 0.209 0.157
I have split groups using:
grouped.my.df <- my.df %>%
group_by(Condition, Load) %>%
select(-var.4, -var.5,-var.6)
I have then tried to run t-tests but not sure how to run it from groups created within the tbl. Is it better to create vectors of each group (if so how) or can I run t-tests directly with the groups created? (The below code is an example of what I want to do, I know it doesn't actually function).
t.test(group.P01.a.1$var.1, group.P01.b.1$var1)
Any help is appreciated.
You are not applying group_by correctly. It doesn't really do anything the way you use it right now.
You can select a subset of your data set with filter, e.g.:
grouped.a.1 = my.df %>% filter(Condition == "a", Load == 1)
grouped.b.1 = my.df %>% filter(Condition == "b", Load == 1)
and then use that in the t.test:
t.test(grouped.a.1$var.1, grouped.b.1$var.1)
or, because t.test also accepts a formula argument if there are two groups:
t.test(var.1 ~ Condition, my.df %>% filter(Load == 1))
Both test the a condition against the b condition for Load == 1. I assume that the discrimination by participant in your t.test(group.P01.a.1$var.1, group.P01.b.1$var1) line was unintended.
I think I misunderstood your question, and what you want may be something like
my.df %>%
select(-Participant) %>%
group_by(Load) %>%
summarize_at(
vars(-group_cols(), -Condition),
list(p.value = ~ t.test(. ~ Condition)$p.value) )
This will give you the p-values of all two-group t-tests between the two conditions for all values of Load and all variables.

Why does R return a low p-value for ANOVA on a set of 1s?

I'm trying to use repeated rounds of ANOVA to sort a large dataset into different categories. For each element in the dataset I have twelve data points which represent three replicates each of four conditions, which arise as two combinitions of two variable1. The data is some relative expression compared to a control, which means that for the control itself all twelve values are 1:
>at
v1 v2 values
1. a X 1
2. b X 1
3. a X 1
4. b X 1
5. a X 1
6. b X 1
7. a Y 1
8. b Y 1
9. a Y 1
10. b Y 1
11. a Y 1
12. b Y 1
which I analyze this way (the Tukey wrapper gives me Information about whether it is up or down in addition to whether it is different, which is why I'm using it):
stats <- TukeyHSD(aov(values~v1+v2, data=at))
> stats
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = values ~ v1 + v2, data = at)
$v1
diff lwr upr p adj
a-b 4.440892e-16 -1.359166e-16 1.024095e-15 0.1173068
$v2
diff lwr upr p adj
X-Y -4.440892e-16 -1.024095e-15 1.359166e-16 0.1173068
I expected the p value to be very close or equal to 1 since clearly the null hypothesis that the two groups of both of these tests are the same is correct. Instead the p-value is very low with 0.117! Clearly the difference and the bounds are tiny (e-16) so I'm guessing the problem is to do with the internal storage of the numbers as slightly off 1, but I'm not sure how to solve the problem. Any suggestions?
Thanks a lot!
I'm adding some sample data:
aX1 bX1 aX2 bX2 aX3 bX3 aY1 bY1 aY2 bY2 aY3 bY3
element1 0.112 0 0.172 0.072 0.058 0.055 0 0 0.046 0 0.042 0
element2 0.859 0.294 0.565 0 0.669 0 0.11 0 1.707 0 1.324 0
element3 1.255 0.721 3.645 1.636 5.36 6.701 0 0.097 0.533 0.209 0.358 2.219

counting function on data frame in R

I have the following data frame:
> Mice
Blood States Minute
1 0.875 X0 0.8352569
2 0.875 A2 0.7551901
3 0.625 X0 1.4508139
4 0.625 A1 0.7876343
5 0.375 X0 1.1345252
6 0.125 X0 0.8699363
7 0.375 X0 0.9378742
8 1.125 H1 0.9769522
9 0.625 X0 0.4716321
10 0.875 H1 0.9935999
11 0.625 X0 1.0025917
12 0.375 A1 1.0703999
13 0.375 X0 1.3044854
14 0.875 H1 0.6720436
15 0.875 A1 1.0431863
So every mouse has some value of drugs in their "Blood", and their "State" is checked. This is just a piece of my data frame, but the mice can be in 4 different states. "Minute" is whenever something occurs to the mice, does not matter what.
For every value of "Blood", the mice can be in either of the 4 different states, and I want to count how many observations I have in each category.
The count() function with both columns Blood and States did not work because "States" is a factor column
To operate on factor levels, you can use tapply or by. If you have discrete scale for Mice$Blood, convert it to a factor as well:
> by(mice$States, as.factor(mice$Blood), function(x) summary(factor(x)))
as.factor(mice$Blood): 0.125
X0
1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.375
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.625
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.875
A1 A2 H1 X0
1 1 2 1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 1.125
H1
1
The returned object is a list, so you may capture it and use for your purposes.

Resources