Preparing data for two way anova in R [duplicate] - r

This question already has answers here:
Extract p-value from aov
(7 answers)
Closed 6 years ago.
I have the following format of the data:
m1 m2 m3 names
1 24.5 28.4 26.1 1
2 23.5 34.2 28.3 2
3 26.4 29.5 24.3 3
4 27.1 32.2 26.2 4
5 29.9 20.1 27.8 5
How can I prepare this data to the format that I can feed to aov in R?
I.e.
values ind name
1 24.5 m1 1
2 23.5 m1 2
3 26.4 m1 3
...
For one way anova I just used stack command. How can I do it for two way anova, without having a loop?

Try this
library(reshape2)
melt(df)

user2100721 has given an answer using a package. Without package imports this can be solved as
a <- read.table(header=TRUE, text="m1 m2 m3 names
24.5 28.4 26.1 1
23.5 34.2 28.3 2
26.4 29.5 24.3 3
27.1 32.2 26.2 4
29.9 20.1 27.8 5")
reshape(a, direction="long", varying=list(c("m1","m2","m3")))

An alternative to all above answers which are nice, you can use gather from tidyr package. Gather takes multiple columns and collapses them into key-value pairs. You need to pass two variables to it, one is the key and the other is value
X<- structure(list(m1 = c(24.5, 23.5, 26.4, 27.1, 29.9), m2 = c(28.4,
34.2, 29.5, 32.2, 20.1), m3 = c(26.1, 28.3, 24.3, 26.2, 27.8),
names = 1:5), .Names = c("m1", "m2", "m3", "names"), class = "data.frame", row.names = c(NA,
-5L))
library(tidyr)
dat <- X %>% gather(variable, value)
> head(dat,10)
# variable value
#1 m1 24.5
#2 m1 23.5
#3 m1 26.4
#4 m1 27.1
#5 m1 29.9
#6 m2 28.4
#7 m2 34.2
#8 m2 29.5
#9 m2 32.2
#10 m2 20.1

Related

Convert a list from data frame (emmGrid class)

I would like to convert a list to dataframe (picture as below)
I did use do.call(rbind.data.frame, contrast), however, I got this Error in xi[[j]] : this S4 class is not subsettable. I still can read them separately. Anyone know about this thing?
This list I got when running the ART anova test by using the package ARTool
Update
This my orignial code to calculate and get the model done.
Organism_df_posthoc <- bird_metrics_long_new %>%
rbind(plant_metrics_long_new) %>%
mutate(Type = factor(Type, levels = c("Forest", "Jungle rubber", "Rubber", "Oil palm"))) %>%
mutate(Category = factor(Category)) %>%
group_by(Category) %>%
mutate_at(c("PD"), ~(scale(.) %>% as.vector())) %>%
ungroup() %>%
nest_by(n1) %>%
mutate(fit = list(art.con(art(PD ~ Category + Type + Category:Type, data = data),
"Category:Type",adjust = "tukey", interaction = T)))
And the output of fit is that I showed already.
With rbind, instead of rbind.data.frame, there is a specific method for 'emmGrid' object and it can directly use the correct method by matching the class if we specify just rbind
do.call(rbind, contrast)
-output
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 12 estimates
The reason is that there is a specific method for rbind when we load the emmeans
> methods('rbind')
[1] rbind.data.frame rbind.data.table* rbind.emm_list* rbind.emmGrid* rbind.grouped_df* rbind.zoo*
The structure in the example created matches the OP's structure showed
By using rbind.data.frame, it doesn't match because the class is already emmGrid
data
library(multcomp)
library(emmeans)
warp.lm <- lm(breaks ~ wool*tension, data = warpbreaks)
warp.emmGrid <- emmeans(warp.lm, ~ tension | wool)
contrast <- list(warp.emmGrid, warp.emmGrid)
If the OP used 'ARTool' and if the columns are different, the above solution may not work because rbind requires all objects to have the same column names. We could convert to tibble by looping over the list with map (from purrr) and bind them
library(ARTool)
library(purrr)
library(tibble)
map_dfr(contrast, as_tibble)
-output
# A tibble: 42 × 8
contrast estimate SE df t.ratio p.value Moisture_pairwise Fertilizer_pairwise
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 m1 - m2 -23.1 4.12 8.00 -5.61 0.00226 NA NA
2 m1 - m3 -33.8 4.12 8.00 -8.20 0.000169 NA NA
3 m1 - m4 -15.2 4.12 8.00 -3.68 0.0256 NA NA
4 m2 - m3 -10.7 4.12 8 -2.59 0.118 NA NA
5 m2 - m4 7.92 4.12 8 1.92 0.291 NA NA
6 m3 - m4 18.6 4.12 8 4.51 0.00849 NA NA
7 NA 6.83 10.9 24 0.625 0.538 m1 - m2 f1 - f2
8 NA 15.3 10.9 24 1.40 0.174 m1 - m3 f1 - f2
9 NA -5.83 10.9 24 -0.533 0.599 m1 - m4 f1 - f2
10 NA 8.50 10.9 24 0.777 0.445 m2 - m3 f1 - f2
# … with 32 more rows
data
data(Higgins1990Table5, package = "ARTool")
m <- art(DryMatter ~ Moisture*Fertilizer + (1|Tray), data=Higgins1990Table5)
a1 <- art.con(m, ~ Moisture)
a2 <- art.con(m, "Moisture:Fertilizer", interaction = TRUE)
contrast <- list(a1, a2)

How to reshape my data frame to use TTR in R? [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
Basically TTR allows to get technical indicator of a ticker and data should be vertical like:
Date Open High Low Close
2014-05-16 16.83 16.84 16.63 16.71
2014-05-19 16.73 16.93 16.66 16.80
2014-05-20 16.80 16.81 16.58 16.70
but my data frame is like:
Sdate Edate Tickers Open_1 Open_2 Open_3 High_1 High_2 High_3 Low_1 Low_2 Low_3 Close_1 Close_2 Close_3
2014-05-16 2014-07-21 TK 31.6 31.8 32.2 32.4 32.4 33.0 31.1 31.5 32.1 32.1 32.1 32.7
2014-05-17 2014-07-22 TGP 25.1 24.8 25.0 25.1 25.3 25.8 24.1 24.4 24.9 24.8 25.0 25.6
2014-05-18 2014-07-23 DNR 3.4 3.5 3.8 3.6 3.8 4.1 3.3 3.5 3.8 3.5 3.7 3.9
As you see I have multiple tickers and time range. I went over package TTR and it does not state how to get technical indicator from which is horizontally made and multiple tickers. My original data has 50days and thousands tickers. To do this, I just knew that, I need to make lists for each tickers, but I'm confused how to do this. How do I achieve this?
You can get data in vertical shape by using pivot_longer :
out <- tidyr::pivot_longer(df, cols = -c(Sdate,Edate, Tickers),
names_to = c('.value', 'num'),
names_sep = '_')
out
# A tibble: 9 x 8
# Sdate Edate Tickers num Open High Low Close
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 2014-05-16 2014-07-21 TK 1 31.6 32.4 31.1 32.1
#2 2014-05-16 2014-07-21 TK 2 31.8 32.4 31.5 32.1
#3 2014-05-16 2014-07-21 TK 3 32.2 33 32.1 32.7
#4 2014-05-17 2014-07-22 TGP 1 25.1 25.1 24.1 24.8
#5 2014-05-17 2014-07-22 TGP 2 24.8 25.3 24.4 25
#6 2014-05-17 2014-07-22 TGP 3 25 25.8 24.9 25.6
#7 2014-05-18 2014-07-23 DNR 1 3.4 3.6 3.3 3.5
#8 2014-05-18 2014-07-23 DNR 2 3.5 3.8 3.5 3.7
#9 2014-05-18 2014-07-23 DNR 3 3.8 4.1 3.8 3.9
If you want to split the above data into list of dataframes based on Ticker you can use split.
split(out, out$Tickers)
data
df <- structure(list(Sdate = c("2014-05-16", "2014-05-17", "2014-05-18"
), Edate = c("2014-07-21", "2014-07-22", "2014-07-23"), Tickers = c("TK",
"TGP", "DNR"), Open_1 = c(31.6, 25.1, 3.4), Open_2 = c(31.8,
24.8, 3.5), Open_3 = c(32.2, 25, 3.8), High_1 = c(32.4, 25.1,
3.6), High_2 = c(32.4, 25.3, 3.8), High_3 = c(33, 25.8, 4.1),
Low_1 = c(31.1, 24.1, 3.3), Low_2 = c(31.5, 24.4, 3.5), Low_3 = c(32.1,
24.9, 3.8), Close_1 = c(32.1, 24.8, 3.5), Close_2 = c(32.1,
25, 3.7), Close_3 = c(32.7, 25.6, 3.9)),
class = "data.frame", row.names = c(NA, -3L))

Gathering multiple data columns currently in factor form

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!
If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6
Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

How to create several new group-based variables most efficiently?

Let's use the following example:
set.seed(2409)
N=5
T=10
id<- rep(LETTERS[1:N],each=T)
time<-rep(1:T, times=N)
var1<-runif(N*T,0,100)
var2<-runif(N*T,0,100)
var3<-runif(N*T,0,100)
var4<-runif(N*T,0,100)
var5<-runif(N*T,0,100)
df<-data.frame(id,time,var1,var2,var3,var4,var5); rm(N,T,id,time,var1,var2,var3,var4,var5)
I now try to execute a function for several of these variables (not the whole series of variables!) and create new variables accordingly.
I already have a suitable code for creating log variables. For this I would use the following code:
cols <- c("var1",
"var3",
"var5")
log <- log(df[cols])
colnames(log) <- paste(colnames(log), "log", sep = "_")
df <- cbind(df,log); rm(log, cols)
This would give me my additional log variables. But now I also want to create lagged and z-transformed variables. These functions refer to the individual IDs. So I wrote the following code that of course works, but is extremely long and inefficient in my real dataset where I apply the function to 38 variables each:
library(Hmisc)
library(dplyr)
df<-df %>%
group_by(id) %>%
mutate(var1_1=Lag(var1, shift=1),
var3_1=Lag(var3, shift=1),
var5_1=Lag(var5, shift=1),
var1_2=Lag(var1, shift=2),
var3_2=Lag(var3, shift=2),
var5_2=Lag(var5, shift=2),
var1_z=scale(var1),
var3_z=scale(var3),
var5_z=scale(var5)
)
I am very sure that there is also a way to make this more efficient. It would be desirable if I could define the original variable once and execute different functions and create new variables as a result.
Thank you very much!
You can use mutate_at with funs. This will apply the three functions in funs to each of the three variables in vars, creating 9 new columns.
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(vars(var1, var3, var5),
funs(lag1 = lag(.), lag2 = lag(., 2), scale))
# # A tibble: 50 x 16
# # Groups: id [5]
# id time var1 var2 var3 var4 var5 var1_lag1 var3_lag1 var5_lag1
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 38.8 25.7 29.2 91.1 35.3 NA NA NA
# 2 A 2 87.1 22.3 8.27 31.5 93.7 38.8 29.2 35.3
# 3 A 3 61.7 38.8 0.887 63.0 50.4 87.1 8.27 93.7
# 4 A 4 0.692 60.1 71.5 74.0 41.6 61.7 0.887 50.4
# 5 A 5 60.1 13.3 90.4 80.6 47.5 0.692 71.5 41.6
# 6 A 6 46.4 3.67 36.7 86.9 67.5 60.1 90.4 47.5
# 7 A 7 80.4 72.1 82.2 25.5 70.3 46.4 36.7 67.5
# 8 A 8 48.8 25.7 93.4 19.8 81.2 80.4 82.2 70.3
# 9 A 9 48.2 31.5 82.1 47.2 49.2 48.8 93.4 81.2
# 10 A 10 21.8 32.6 76.5 19.7 41.1 48.2 82.1 49.2
# # ... with 40 more rows, and 6 more variables: var1_lag2 <dbl>, var3_lag2 <dbl>,
# # var5_lag2 <dbl>, var1_scale <dbl>, var3_scale <dbl>, var5_scale <dbl>
Here is an option with data.table
library(data.table)
nm1 <- c('var1', 'var3', 'var5')
nm2 <- paste0(nm1, rep(c('_lag1', '_lag2'), each = 3))
nm3 <- paste0(nm1, '_scale')
setDT(df)[, c(nm2, nm3) := c(shift(.SD, n = 1:2), lapply(.SD,
function(x) as.vector(scale(x)))), by = id, .SDcols = nm1]'

Nesting several groups of columns inside a data frame

The concept of nesting several columns into a single list-column is very powerful. However, I am not sure whether it is possible at all to nest more than one set of columns into several list-columns within the same pipeline using the nest function in {tidyr}. For instance, assume I have the following data frame:
df <- as.data.frame(replicate(6, runif(10) * 100))
colnames(df) <- c(
paste0("a", 1:2), # a1, a2
paste0("b", 1:4) # b1, b2, b3, b4
)
df
a1 a2 b1 b2 b3 b4
1 20.807348 69.339482 91.837151 99.76813 3.394350 33.780049
2 64.667733 20.676381 80.523369 38.42774 85.635208 60.111491
3 55.352501 55.699571 4.812923 38.65333 98.869203 80.345576
4 45.194094 16.511696 83.834651 51.48698 7.191081 16.697210
5 66.401642 89.041055 26.965636 67.90061 90.622428 59.552935
6 35.750100 55.997766 49.768556 68.45900 67.523080 58.993232
7 21.392823 5.335281 56.348328 35.68331 51.029617 66.290035
8 8.851236 19.486580 14.199370 22.49754 14.617592 18.236406
9 70.475652 6.229997 43.169364 12.63378 21.415589 2.163004
10 47.837613 37.641530 38.001288 71.15896 71.000568 2.135611
I would like to nest the "a" columns into a list-column AND nest the "b" columns into a second list-column because I would like to perform different computations on them.
Nesting the "a" columns works:
library(tidyr)
nest(df, a1, a2, .key = "a")
b1 b2 b3 b4 a
1 91.837151 99.76813 3.394350 33.780049 20.80735, 69.33948
2 80.523369 38.42774 85.635208 60.111491 64.66773, 20.67638
3 4.812923 38.65333 98.869203 80.345576 55.35250, 55.69957
4 83.834651 51.48698 7.191081 16.697210 45.19409, 16.51170
5 26.965636 67.90061 90.622428 59.552935 66.40164, 89.04105
6 49.768556 68.45900 67.523080 58.993232 35.75010, 55.99777
7 56.348328 35.68331 51.029617 66.290035 21.392823, 5.335281
8 14.199370 22.49754 14.617592 18.236406 8.851236, 19.486580
9 43.169364 12.63378 21.415589 2.163004 70.475652, 6.229997
10 38.001288 71.15896 71.000568 2.135611 47.83761, 37.64153
But it is impossible to nest the "b" columns AFTER the "a" columns have been nested:
nest(df, a1, a2, .key = "a") %>%
nest(b1, b2, b3, b4, .key = "b")
Error in grouped_df_impl(data, unname(vars), drop) :
Column `a` can't be used as a grouping variable because it's a list
which makes sense by reading the error message.
My work-around is to:
nest the "a" columns
perform the required computations on the "a" list-column
unnest the "a" list-column
nest the "b" columns
perform the required computations on the "b" list-column
unnest the "b" list-column
Is there a more straight-forward way to achieve this? Your help is much appreciated.
We can use map to do this
library(tidyverse)
out <- list('a', 'b') %>%
map(~ df %>%
select(matches(.x)) %>%
nest(names(.), .key = !! rlang::sym(.x))) %>%
bind_cols
out
# A tibble: 1 x 2
# a b
# <list> <list>
#1 <data.frame [10 × 2]> <data.frame [10 × 4]>
out %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20.8 69.3 91.8 99.8 3.39 33.8
# 2 64.7 20.7 80.5 38.4 85.6 60.1
# 3 55.4 55.7 4.81 38.7 98.9 80.3
# 4 45.2 16.5 83.8 51.5 7.19 16.7
# 5 66.4 89.0 27.0 67.9 90.6 59.6
# 6 35.8 56.0 49.8 68.5 67.5 59.0
# 7 21.4 5.34 56.3 35.7 51.0 66.3
# 8 8.85 19.5 14.2 22.5 14.6 18.2
# 9 70.5 6.23 43.2 12.6 21.4 2.16
#10 47.8 37.6 38.0 71.2 71.0 2.14
We could do the separate computations on the 'a' and 'b' list of columns
out %>%
mutate(a = map(a, `*`, 4)) %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 83.2 277. 91.8 99.8 3.39 33.8
# 2 259. 82.7 80.5 38.4 85.6 60.1
# 3 221. 223. 4.81 38.7 98.9 80.3
# 4 181. 66.0 83.8 51.5 7.19 16.7
# 5 266. 356. 27.0 67.9 90.6 59.6
# 6 143. 224. 49.8 68.5 67.5 59.0
# 7 85.6 21.3 56.3 35.7 51.0 66.3
# 8 35.4 77.9 14.2 22.5 14.6 18.2
# 9 282. 24.9 43.2 12.6 21.4 2.16
#10 191. 151. 38.0 71.2 71.0 2.14
Having said that, it is also possible to select columns of interest with mutate_at instead of doing nest/unnest
df %>%
mutate_at(vars(matches('^a\\d+')), funs(.*4))

Resources