how insert name of files in column in R [duplicate] - r

This question already has answers here:
Combine a list of data frames into one data frame by row
(10 answers)
Closed 2 years ago.
suppose, i have such files
rock=structure(list(x1 = c(0, 0.8, 0.4, 0.3, 0.5, 1, 0.7, 0.6, 0.4,
0.4, 0.6), x2 = c(0, 1, 0.5, 0.3, 0.5, 0.5, 0.8, 0.3, 0.6, 0.8,
0.7), x3 = c(0, 0.4, 0.8, 0.4, 0.2, 1, 0.5, 0.8, 0.4, 1, 0.3),
x4 = c(0, 0.3, 0.4, 0.4, 0.5, 0.6, 0.8, 0.3, 0.7, 0.6, 0.2
)), class = "data.frame", row.names = c(NA, -11L))
rave=structure(list(x1 = c(0, 0.8, 0.4, 0.3, 0.5, 1), x2 = c(0, 1,
0.5, 0.3, 0.5, 0.5), x3 = c(0, 0.4, 0.8, 0.4, 0.2, 1), x4 = c(0,
0.3, 0.4, 0.4, 0.5, 0.6)), class = "data.frame", row.names = c(NA,
-6L))
classic=structure(list(x1 = c(0, 0.8), x2 = 0:1, x3 = c(0, 0.4), x4 = c(0,
0.3)), class = "data.frame", row.names = c(NA, -2L))
How to do that when i rbind these datasets, for each dataset paste original name
I.e the result i want to see this like this. Initial data with names in csv format. For example
classic=read.csv(path to classic.csv)
dataset x1 x2 x3 x4
1 classic 0.0 0.0 0.0 0.0
2 classic 0.8 1.0 0.4 0.3
3 Rave 0.0 0.0 0.0 0.0
4 Rave 0.8 1.0 0.4 0.3
5 Rave 0.4 0.5 0.8 0.4
6 Rave 0.3 0.3 0.4 0.4
7 Rave 0.5 0.5 0.2 0.5
8 rock 0.0 0.0 0.0 0.0
9 rock 0.8 1.0 0.4 0.3
10 rock 0.4 0.5 0.8 0.4
11 rock 0.3 0.3 0.4 0.4
12 rock 0.5 0.5 0.2 0.5
13 rock 1.0 0.5 1.0 0.6
14 rock 0.7 0.8 0.5 0.8
15 rock 0.6 0.3 0.8 0.3
16 rock 0.4 0.6 0.4 0.7
17 rock 0.4 0.8 1.0 0.6
18 rock 0.6 0.7 0.3 0.2

Put them in a list and use bind_rows :
library(dplyr)
bind_rows(lst(rock, rave, classic), .id = 'dataset')
# dataset x1 x2 x3 x4
#1 rock 0.0 0.0 0.0 0.0
#2 rock 0.8 1.0 0.4 0.3
#3 rock 0.4 0.5 0.8 0.4
#4 rock 0.3 0.3 0.4 0.4
#5 rock 0.5 0.5 0.2 0.5
#6 rock 1.0 0.5 1.0 0.6
#7 rock 0.7 0.8 0.5 0.8
#8 rock 0.6 0.3 0.8 0.3
#9 rock 0.4 0.6 0.4 0.7
#10 rock 0.4 0.8 1.0 0.6
#11 rock 0.6 0.7 0.3 0.2
#12 rave 0.0 0.0 0.0 0.0
#13 rave 0.8 1.0 0.4 0.3
#14 rave 0.4 0.5 0.8 0.4
#15 rave 0.3 0.3 0.4 0.4
#16 rave 0.5 0.5 0.2 0.5
#17 rave 1.0 0.5 1.0 0.6
#18 classic 0.0 0.0 0.0 0.0
#19 classic 0.8 1.0 0.4 0.3
However, it would be better if you could read the data in a list automatically without reading them individually first.
library(dplyr)
library(purrr)
filenames <- list.files('/path/to/csv', pattern = '\\.csv', full.names = TRUE)
result <- map_df(filenames,
~read.csv(.x) %>%
mutate(dataset = tools::file_path_sans_ext(basename(.x))))

you can add a column with a constant name to your datasets then
rbind and then put last column to first position
classic['dataset'] = 'classic'
rave['dataset'] = 'rave'
rock['dataset'] = 'rock'
df <- rbind(classic, rave, rock)
df <- df[,c(ncol(df), 1:ncol(df)-1)]

Related

Group similar strings with numbers and keep order of first appearance

I have a dataframe which looks like this example (just much larger):
var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)
var v1 v2
1 Peter 0.4 0.5
2 Ben 0.6 0.4
3 Mary 0.7 0.2
4 Peter.1 0.3 0.5
5 Ben.1 0.9 0.4
6 Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8 Ben.2 0.6 0.4
9 Mary.2 0.7 0.2
I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:
var v1 v2
1 Peter 0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4 # Ben appears second in the original data
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2 # Mary appears third in the original data
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
How can I achieve that?
Thank you!
An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data
library(dplyr)
library(stringr)
df <- df %>%
mutate(var1 = str_remove(var, "\\.\\d+$")) %>%
arrange(factor(var1, levels = unique(var1))) %>%
select(-var1)
Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance
library(forcats)
df %>%
arrange(fct_inorder(str_remove(var, "\\.\\d+$")))
-output
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Compact option with sub and data.table::chgroup
df[chgroup(sub("\\..", "", df$var)),]
var v1 v2
1 Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2 Ben 0.6 0.4
5 Ben.1 0.9 0.4
8 Ben.2 0.6 0.4
3 Mary 0.7 0.2
6 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently
If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:
df %>%
arrange(var)
var v1 v2
1 Ben 0.6 0.4
2 Ben.1 0.9 0.4
3 Ben.2 0.6 0.4
4 Mary 0.7 0.2
5 Mary.1 0.2 0.2
6 Mary.2 0.7 0.2
7 Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1
separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.
This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.
df %>%
separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>%
mutate(no = replace_na(no, 0)) %>%
arrange(match(alpha, alpha), no) %>%
select(-alpha, -no)
giving
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Update
Have removed what was previously the first solution after reading the update to the question.

Creating a index for unique combination of columns in R

I got a set of data just like that:
df = data.frame(A = c(0.1, 0.3, 0.7, 0.9, 0.5, 0.4, 0.3, 0.3, 0.9, 0.9),
B = c(0.5, 0.4, 0.8, 0.6, 0.8, 0.5, 0.4, 0.5, 0.6, 0.5),
D = c(0.2, 0.1, 0.5, 0.8, 0.6, 0.7, 0.1, 0.3, 0.8, 0.3))
but i need to create a index for all unique combination of A, B and D. Just like that:
index A B D
1 1 0.1 0.5 0.2
2 2 0.3 0.4 0.1
3 3 0.7 0.8 0.5
4 4 0.9 0.6 0.8
5 5 0.5 0.8 0.6
6 6 0.4 0.5 0.7
7 2 0.3 0.4 0.1
8 7 0.3 0.5 0.3
9 4 0.9 0.6 0.8
10 8 0.9 0.5 0.3
Note that the combination between A, B and D is the same for rows 4 and 9 and for rows 2 and 7. Therefore, they receive the same index value
You can use the following code. Maybe the naming of indices have a slight difference than your output but the logic is the same:
library(dplyr)
df %>%
group_by(A, B, D) %>%
mutate(index = cur_group_id()) %>%
ungroup() %>%
arrange(index)
# A tibble: 10 x 4
A B D index
<dbl> <dbl> <dbl> <int>
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.3 0.4 0.1 2
4 0.3 0.5 0.3 3
5 0.4 0.5 0.7 4
6 0.5 0.8 0.6 5
7 0.7 0.8 0.5 6
8 0.9 0.5 0.3 7
9 0.9 0.6 0.8 8
10 0.9 0.6 0.8 8
We can use match
library(dplyr)
library(stringr)
df %>%
mutate(index = match(str_c(A, B, D), unique(str_c(A, B, D)))) %>%
arrange(index)
Another dplyr option
df %>%
distinct() %>%
mutate(index = 1:n()) %>%
left_join(x = df)
gives
A B D index
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.7 0.8 0.5 3
4 0.9 0.6 0.8 4
5 0.5 0.8 0.6 5
6 0.4 0.5 0.7 6
7 0.3 0.4 0.1 2
8 0.3 0.5 0.3 7
9 0.9 0.6 0.8 4
10 0.9 0.5 0.3 8

R dataframe: loop through multiple columns and row values

I am new to R. I have a large dataframe with millions of rows that looks like below:
Whole code1 P_1 Q_1 code2 P_2 Q_2 code3 P_3 Q_3
64 a 0.2 0.1 b 0.3 0.2 d 0.1 0.9
55 a 0.5 0.3 c 0.1 0.3 b 0.4 0.4
70 b 0.4 0.1 d 0.2 0.5 NULL 0.7 0.7
26 c 0.7 0.5 a 0.2 0.6 b 0.2 0.2
47 a 0.8 0.7 d 0.1 0.2 NULL 0.6 0.8
35 d 0.2 0.8 b 0.8 0.1 a 0.2 0.1
I am looking for three output fields depending on the values in code1, code2, and code3.
> Output1 : If code1 is 'a' or 'b', then Output1 = Whole*P_1, else Output1 = Whole* Q_1
> Output2 : If code1 is 'a' or 'b', then Output1 = Whole*P_2, else Output2 = Whole* Q_2
> Output3 : If code1 is 'a' or 'b', then Output1 = Whole*P_3, else Output3 = Whole* Q_3
Would appreciate if this code below could be corrected:
df1 %>%
for (i in 1:6) {
if (paste0("code", i) %in% c("a", "b")) {
mutate (paste0("Output", i) = Whole * paste0("P_", i) )
} else {
mutate (paste0("Output", i) = Whole * paste0("Q_", i) )
}
}
library(dplyr)
df1 %>%
mutate(
Output1 = Whole * if_else(code1 %in% c('a', 'b'), P_1, Q_1),
Output2 = Whole * if_else(code1 %in% c('a', 'b'), P_2, Q_2),
Output3 = Whole * if_else(code1 %in% c('a', 'b'), P_3, Q_3)
)
# Whole code1 P_1 Q_1 code2 P_2 Q_2 code3 P_3 Q_3 Output1 Output2 Output3
# 1 64 a 0.2 0.1 b 0.3 0.2 d 0.1 0.9 12.8 19.2 6.4
# 2 55 a 0.5 0.3 c 0.1 0.3 b 0.4 0.4 27.5 5.5 22.0
# 3 70 b 0.4 0.1 d 0.2 0.5 NULL 0.7 0.7 28.0 14.0 49.0
# 4 26 c 0.7 0.5 a 0.2 0.6 b 0.2 0.2 13.0 15.6 5.2
# 5 47 a 0.8 0.7 d 0.1 0.2 NULL 0.6 0.8 37.6 4.7 28.2
# 6 35 d 0.2 0.8 b 0.8 0.1 a 0.2 0.1 28.0 3.5 3.5
If your data is more generic (not hard-coded or many more than "3" sets of columns), then we can reshape the data, do the assignment, and shape it back.
library(tidyr)
df1 %>%
rename_at(vars(starts_with("code")), ~ gsub("(\\D+)", "\\1_", .)) %>%
pivot_longer(
-Whole,
names_to = c(".value", "set"),
names_sep = "_"
) %>%
mutate(Output = Whole * if_else(code %in% c("a", "b"), P, Q)) %>%
pivot_wider(
id_cols = Whole,
names_from = set,
values_from = c(code, P, Q, Output),
names_sep = "_"
)
# # A tibble: 6 x 13
# Whole code_1 code_2 code_3 P_1 P_2 P_3 Q_1 Q_2 Q_3 Output_1 Output_2 Output_3
# <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 64 a b d 0.2 0.3 0.1 0.1 0.2 0.9 12.8 19.2 57.6
# 2 55 a c b 0.5 0.1 0.4 0.3 0.3 0.4 27.5 16.5 22
# 3 70 b d NULL 0.4 0.2 0.7 0.1 0.5 0.7 28 35 49
# 4 26 c a b 0.7 0.2 0.2 0.5 0.6 0.2 13 5.2 5.2
# 5 47 a d NULL 0.8 0.1 0.6 0.7 0.2 0.8 37.6 9.4 37.6
# 6 35 d b a 0.2 0.8 0.2 0.8 0.1 0.1 28 28 7
As a side note, in general I would recommend keeping it in the "long" format and not re-widening it. This "long" format is often preferred for other tidy-like functions (including ggplot2), and is easily extended to arbitrary counts. That would result in data like this:
df1 %>%
rename_at(vars(starts_with("code")), ~ gsub("(\\D+)", "\\1_", .)) %>%
pivot_longer(
-Whole,
names_to = c(".value", "set"),
names_sep = "_"
) %>%
mutate(Output = Whole * if_else(code %in% c("a", "b"), P, Q))
# # A tibble: 18 x 6
# Whole set code P Q Output
# <int> <chr> <chr> <dbl> <dbl> <dbl>
# 1 64 1 a 0.2 0.1 12.8
# 2 64 2 b 0.3 0.2 19.2
# 3 64 3 d 0.1 0.9 57.6
# 4 55 1 a 0.5 0.3 27.5
# 5 55 2 c 0.1 0.3 16.5
# 6 55 3 b 0.4 0.4 22
# 7 70 1 b 0.4 0.1 28
# 8 70 2 d 0.2 0.5 35
# 9 70 3 NULL 0.7 0.7 49
# 10 26 1 c 0.7 0.5 13
# 11 26 2 a 0.2 0.6 5.2
# 12 26 3 b 0.2 0.2 5.2
# 13 47 1 a 0.8 0.7 37.6
# 14 47 2 d 0.1 0.2 9.4
# 15 47 3 NULL 0.6 0.8 37.6
# 16 35 1 d 0.2 0.8 28
# 17 35 2 b 0.8 0.1 28
# 18 35 3 a 0.2 0.1 7
(Much shorter.)
We can use map2. Get the names of the columns that have 'P', 'Q'. followed by digits after the _. then loop over the corresponding columns with map2, apply the transformation logic and bind the columns with the original dataset
library(dplyr)
library(purrr)
library(stringr)
ps <- names(df1)[str_detect(names(df1), "^P_\\d+$")]
qs <- names(df1)[str_detect(names(df1), "^Q_\\d+$")]
map2_dfc(ps, qs, ~ df1 %>%
transmute(Output = Whole *
case_when(code1 %in% c('a', 'b') ~ !! rlang::sym(.x),
TRUE ~ !! rlang::sym(.y)))) %>%
rename_all(~ str_remove(., fixed("..."))) %>%
bind_cols(df1, .)
# Whole code1 P_1 Q_1 code2 P_2 Q_2 code3 P_3 Q_3 Output1 Output2 Output3
#1 64 a 0.2 0.1 b 0.3 0.2 d 0.1 0.9 12.8 19.2 6.4
#2 55 a 0.5 0.3 c 0.1 0.3 b 0.4 0.4 27.5 5.5 22.0
#3 70 b 0.4 0.1 d 0.2 0.5 NULL 0.7 0.7 28.0 14.0 49.0
#4 26 c 0.7 0.5 a 0.2 0.6 b 0.2 0.2 13.0 15.6 5.2
#5 47 a 0.8 0.7 d 0.1 0.2 NULL 0.6 0.8 37.6 4.7 28.2
#6 35 d 0.2 0.8 b 0.8 0.1 a 0.2 0.1 28.0 3.5 3.5
data
df1 <- structure(list(Whole = c(64L, 55L, 70L, 26L, 47L, 35L), code1 = c("a",
"a", "b", "c", "a", "d"), P_1 = c(0.2, 0.5, 0.4, 0.7, 0.8, 0.2
), Q_1 = c(0.1, 0.3, 0.1, 0.5, 0.7, 0.8), code2 = c("b", "c",
"d", "a", "d", "b"), P_2 = c(0.3, 0.1, 0.2, 0.2, 0.1, 0.8), Q_2 = c(0.2,
0.3, 0.5, 0.6, 0.2, 0.1), code3 = c("d", "b", "NULL", "b", "NULL",
"a"), P_3 = c(0.1, 0.4, 0.7, 0.2, 0.6, 0.2), Q_3 = c(0.9, 0.4,
0.7, 0.2, 0.8, 0.1)), class = "data.frame", row.names = c(NA,
-6L))
Depending on how many rows you have, this data.table approach may be faster.
library(data.table)
setDT(df1)[,Logical := (code1 == "a" | code1 == "b")][
,`:=`(Output1 = numeric(),Output2 = numeric(), Output3 = numeric())
][Logical == TRUE,`:=`(Output1 = Whole * P_1,
Output2 = Whole * P_2,
Output3 = Whole * P_3)
][Logical == FALSE,`:=`(Output1 = Whole * Q_1,
Output2 = Whole * Q_2,
Output3 = Whole * Q_3)
][,.(Output1,Output2,Output3)]
Output1 Output2 Output3
1: 12.8 19.2 6.4
2: 27.5 5.5 22.0
3: 28.0 14.0 49.0
4: 13.0 15.6 5.2
5: 37.6 4.7 28.2
6: 28.0 3.5 3.5

Reshape from wide to long in R where id and value of id are in the same row

I am having trouble to reshape my data set to a panel data set. My df looks as follows
id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1
I would like to reshape so it looks as follows
id s ct1 ct2 ret
1 a 0.5 0.5 0.6
1 b 0.5 0.5 0.7
1 c 0.5 0.5 0.8
1 d 0.5 0.5 0.5
2 a 0.6 0.6 0.5
2 b 0.6 0.6 0.6
2 c 0.6 0.6 0.7
2 d 0.6 0.6 0.4
3 a 0.7 0.7 0.7
3 b 0.7 0.7 0.1
3 c 0.7 0.7 0.8
3 d 0.7 0.7 0.2
I regularly reshape from wide to long but somehow my head cannot get around this problem.
1) base R
An option using reshape
out <- reshape(
dat,
idvar = c("id", "ct1", "ct2"),
varying = c(outer(c("s", "ret"), 1:4, paste0)),
sep = "",
direction = "long"
)
Remove rownames and column time
rownames(out) <- out$time <- NULL
Result
out[order(out$id), ]
# id ct1 ct2 s ret
#1 1 0.5 0.5 a 0.6
#4 1 0.5 0.5 b 0.7
#7 1 0.5 0.5 c 0.8
#10 1 0.5 0.5 d 0.5
#2 2 0.6 0.6 c 0.7
#5 2 0.6 0.6 b 0.6
#8 2 0.6 0.6 a 0.5
#11 2 0.6 0.6 d 0.4
#3 3 0.7 0.7 a 0.7
#6 3 0.7 0.7 c 0.8
#9 3 0.7 0.7 d 0.2
#12 3 0.7 0.7 b 0.1
2) data.table
Using melt from data.table
library(data.table)
out <- melt(
setDT(dat),
id.vars = c("id", "ct1", "ct2"),
measure.vars = patterns(c("^s\\d", "^ret\\d")),
value.name = c("s", "ret")
)[, variable := NULL]
out
data
dat <- structure(list(id = 1:3, s1 = structure(c(1L, 2L, 1L), .Label = c("a",
"c"), class = "factor"), s2 = structure(c(1L, 1L, 2L), .Label = c("b",
"c"), class = "factor"), s3 = structure(c(2L, 1L, 3L), .Label = c("a",
"c", "d"), class = "factor"), s4 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), ct1 = c(0.5, 0.6, 0.7), ct2 = c(0.5,
0.6, 0.7), ret1 = c(0.6, 0.7, 0.7), ret2 = c(0.7, 0.6, 0.8),
ret3 = c(0.8, 0.5, 0.2), ret4 = c(0.5, 0.4, 0.1)), .Names = c("id",
"s1", "s2", "s3", "s4", "ct1", "ct2", "ret1", "ret2", "ret3",
"ret4"), class = "data.frame", row.names = c(NA, -3L))
You could do it using spread and gather from the tidyr package. You will need to create a temporary id variable in order to be able to pivot the data:
library(dplyr)
library(tidyr)
df %>%
gather(key, value , -id, -ct1, -ct2) %>%
mutate(key = str_extract(key, "[:alpha:]+")) %>%
group_by(key) %>%
mutate(tmp_id = row_number()) %>%
ungroup() %>%
spread(key, value) %>%
select(id, s, ct1, ct2, ret)
Here is one way that the development version of tidyr (install with devtools::install_github("tidyverse/tidyr")) can make this a lot easier with pivot_longer. We make a spec indicating that the s columns should go into an s variable and similarly for the ret columns. You can remove the final obs column that indicates the number after s or ret if desired.
library(tidyverse)
tbl <- read_table2(
"id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1"
)
spec <- tibble(
`.name` = tbl %>% select(matches("^s|ret")) %>% colnames(),
`.value` = str_remove(`.name`, "\\d$"),
obs = str_extract(`.name`, "\\d")
)
tbl %>%
pivot_longer(spec = spec)
#> # A tibble: 12 x 6
#> id ct1 ct2 obs s ret
#> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 0.5 0.5 1 a 0.6
#> 2 1 0.5 0.5 2 b 0.7
#> 3 1 0.5 0.5 3 c 0.8
#> 4 1 0.5 0.5 4 d 0.5
#> 5 2 0.6 0.6 1 c 0.7
#> 6 2 0.6 0.6 2 b 0.6
#> 7 2 0.6 0.6 3 a 0.5
#> 8 2 0.6 0.6 4 d 0.4
#> 9 3 0.7 0.7 1 a 0.7
#> 10 3 0.7 0.7 2 c 0.8
#> 11 3 0.7 0.7 3 d 0.2
#> 12 3 0.7 0.7 4 b 0.1
Created on 2019-07-23 by the reprex package (v0.3.0)

How to plot average of multiple columns by factor variables

I am trying to plot what is essentially calculated average time-series data for a dependent variable with 2 independent variables. DV = pupil dilation (at multiple time points "T") in response doing a motor task (IV_A) in combination with 3 different speech-in-noise signals (IV_B).
I would like to plot the average dilation across subjects at each time point (mean for each T column) , with separate lines for each condition.
So, the x axis would be T1 to T5 with a separate line for IV_A(=1):IV_B(=1),IV_A(=1):IV_B(=2),and IV_A(=1):IV_B(=3)
Depending how it looks, I might want the IV_A(=2) lines on a separate plot. But all in one graph would make for an easy visual comparison.
I'm wondering if I need to melt the data, to make it extremely long (there are about 110 T columns), or if there is away to accomplish what I want without restructuring the data frame.
The data look something like this:
Subject IV_A IV_B T1 T2 T3 T4 T5
1 1 1 0.2 0.3 0.5 0.6 0.3
1 1 2 0.3 0.2 0.3 0.4 0.4
1 1 3 0.2 0.4 0.5 0.2 0.3
1 2 1 0.3 0.2 0.3 0.4 0.4
1 2 2 0.2 0.3 0.5 0.6 0.3
1 2 3 0.2 0.4 0.5 0.2 0.3
2 1 1 0.2 0.3 0.5 0.6 0.3
2 1 2 0.3 0.2 0.3 0.4 0.4
2 1 3 0.2 0.4 0.5 0.2 0.3
2 2 1 0.3 0.2 0.3 0.4 0.4
2 2 2 0.2 0.3 0.5 0.6 0.3
2 2 3 0.2 0.4 0.5 0.2 0.3
3 1 1 0.2 0.3 0.5 0.6 0.3
3 1 2 0.3 0.2 0.3 0.4 0.4
3 1 3 0.2 0.4 0.5 0.2 0.3
3 2 1 0.3 0.2 0.3 0.4 0.4
3 2 2 0.2 0.3 0.5 0.6 0.3
3 2 3 0.2 0.4 0.5 0.2 0.3
Edit:
Unfortunately, I can't adapt #eipi10 's code to my actual data frame, which looks as follows:
Subject Trk_Y.N NsCond X.3 X.2 X.1 X0 X1 X2 X3
1 N Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
Trk_Y.N means was the block with or without a secondary motor tracking task ("Yes" or "No"). NsCond is the type of noise the speech stimuli are presented in.
It's likely better to replace "Y" with "Tracking" and "N" with "No_Tracking".
I tried:
test_data[test_data$Trk_Y.N == "Y",]$Trk_Y.N = "Tracking"
But got an error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = c("Tracking", "Tracking", :
invalid factor level, NA generated
I may not have understood your data structure, so please let me know if this isn't what you had in mind:
library(reshape2)
library(ggplot2)
library(dplyr)
"Melt" data to long format. This will give us one observation for each Subject, IV and Time:
# Convert the two `IV` columns into a single column
df.m = df %>% mutate(IV = paste0("A",IV_A,":","B",IV_B)) %>% select(-IV_A,-IV_B)
# Melt to long format
df.m = melt(df.m, id.var=c("Subject","IV"), variable.name="Time", value.name="Pupil_Dilation")
head(df.m)
Subject IV Time Pupil_Dilation
1 1 A1:B1 T1 0.2
2 1 A1:B2 T1 0.3
3 1 A1:B3 T1 0.2
4 1 A2:B1 T1 0.3
5 1 A2:B2 T1 0.2
6 1 A2:B3 T1 0.2
Now we can plot a line giving the average value of Pupil_Dilation for each Time point for each level of IV, plus 95% confidence intervals. In your sample data, there's only a single measurement at each Time for each level of IV so no 95% confidence interval is included in the example graph below. However, if you have multiple measurements in your actual data, then you can use the code below to include the confidence interval:
pd=position_dodge(0.5)
ggplot(df.m, aes(Time, Pupil_Dilation, colour=IV, group=IV)) +
stat_summary(fun.data=mean_cl_boot, geom="errorbar", width=0.1, position=pd) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
scale_y_continuous(limits=c(0, max(df.m$Pupil_Dilation))) +
theme_bw()

Resources