How to plot average of multiple columns by factor variables - r

I am trying to plot what is essentially calculated average time-series data for a dependent variable with 2 independent variables. DV = pupil dilation (at multiple time points "T") in response doing a motor task (IV_A) in combination with 3 different speech-in-noise signals (IV_B).
I would like to plot the average dilation across subjects at each time point (mean for each T column) , with separate lines for each condition.
So, the x axis would be T1 to T5 with a separate line for IV_A(=1):IV_B(=1),IV_A(=1):IV_B(=2),and IV_A(=1):IV_B(=3)
Depending how it looks, I might want the IV_A(=2) lines on a separate plot. But all in one graph would make for an easy visual comparison.
I'm wondering if I need to melt the data, to make it extremely long (there are about 110 T columns), or if there is away to accomplish what I want without restructuring the data frame.
The data look something like this:
Subject IV_A IV_B T1 T2 T3 T4 T5
1 1 1 0.2 0.3 0.5 0.6 0.3
1 1 2 0.3 0.2 0.3 0.4 0.4
1 1 3 0.2 0.4 0.5 0.2 0.3
1 2 1 0.3 0.2 0.3 0.4 0.4
1 2 2 0.2 0.3 0.5 0.6 0.3
1 2 3 0.2 0.4 0.5 0.2 0.3
2 1 1 0.2 0.3 0.5 0.6 0.3
2 1 2 0.3 0.2 0.3 0.4 0.4
2 1 3 0.2 0.4 0.5 0.2 0.3
2 2 1 0.3 0.2 0.3 0.4 0.4
2 2 2 0.2 0.3 0.5 0.6 0.3
2 2 3 0.2 0.4 0.5 0.2 0.3
3 1 1 0.2 0.3 0.5 0.6 0.3
3 1 2 0.3 0.2 0.3 0.4 0.4
3 1 3 0.2 0.4 0.5 0.2 0.3
3 2 1 0.3 0.2 0.3 0.4 0.4
3 2 2 0.2 0.3 0.5 0.6 0.3
3 2 3 0.2 0.4 0.5 0.2 0.3
Edit:
Unfortunately, I can't adapt #eipi10 's code to my actual data frame, which looks as follows:
Subject Trk_Y.N NsCond X.3 X.2 X.1 X0 X1 X2 X3
1 N Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
Trk_Y.N means was the block with or without a secondary motor tracking task ("Yes" or "No"). NsCond is the type of noise the speech stimuli are presented in.
It's likely better to replace "Y" with "Tracking" and "N" with "No_Tracking".
I tried:
test_data[test_data$Trk_Y.N == "Y",]$Trk_Y.N = "Tracking"
But got an error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = c("Tracking", "Tracking", :
invalid factor level, NA generated

I may not have understood your data structure, so please let me know if this isn't what you had in mind:
library(reshape2)
library(ggplot2)
library(dplyr)
"Melt" data to long format. This will give us one observation for each Subject, IV and Time:
# Convert the two `IV` columns into a single column
df.m = df %>% mutate(IV = paste0("A",IV_A,":","B",IV_B)) %>% select(-IV_A,-IV_B)
# Melt to long format
df.m = melt(df.m, id.var=c("Subject","IV"), variable.name="Time", value.name="Pupil_Dilation")
head(df.m)
Subject IV Time Pupil_Dilation
1 1 A1:B1 T1 0.2
2 1 A1:B2 T1 0.3
3 1 A1:B3 T1 0.2
4 1 A2:B1 T1 0.3
5 1 A2:B2 T1 0.2
6 1 A2:B3 T1 0.2
Now we can plot a line giving the average value of Pupil_Dilation for each Time point for each level of IV, plus 95% confidence intervals. In your sample data, there's only a single measurement at each Time for each level of IV so no 95% confidence interval is included in the example graph below. However, if you have multiple measurements in your actual data, then you can use the code below to include the confidence interval:
pd=position_dodge(0.5)
ggplot(df.m, aes(Time, Pupil_Dilation, colour=IV, group=IV)) +
stat_summary(fun.data=mean_cl_boot, geom="errorbar", width=0.1, position=pd) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
scale_y_continuous(limits=c(0, max(df.m$Pupil_Dilation))) +
theme_bw()

Related

Group similar strings with numbers and keep order of first appearance

I have a dataframe which looks like this example (just much larger):
var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)
var v1 v2
1 Peter 0.4 0.5
2 Ben 0.6 0.4
3 Mary 0.7 0.2
4 Peter.1 0.3 0.5
5 Ben.1 0.9 0.4
6 Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8 Ben.2 0.6 0.4
9 Mary.2 0.7 0.2
I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:
var v1 v2
1 Peter 0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4 # Ben appears second in the original data
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2 # Mary appears third in the original data
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
How can I achieve that?
Thank you!
An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data
library(dplyr)
library(stringr)
df <- df %>%
mutate(var1 = str_remove(var, "\\.\\d+$")) %>%
arrange(factor(var1, levels = unique(var1))) %>%
select(-var1)
Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance
library(forcats)
df %>%
arrange(fct_inorder(str_remove(var, "\\.\\d+$")))
-output
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Compact option with sub and data.table::chgroup
df[chgroup(sub("\\..", "", df$var)),]
var v1 v2
1 Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2 Ben 0.6 0.4
5 Ben.1 0.9 0.4
8 Ben.2 0.6 0.4
3 Mary 0.7 0.2
6 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently
If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:
df %>%
arrange(var)
var v1 v2
1 Ben 0.6 0.4
2 Ben.1 0.9 0.4
3 Ben.2 0.6 0.4
4 Mary 0.7 0.2
5 Mary.1 0.2 0.2
6 Mary.2 0.7 0.2
7 Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1
separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.
This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.
df %>%
separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>%
mutate(no = replace_na(no, 0)) %>%
arrange(match(alpha, alpha), no) %>%
select(-alpha, -no)
giving
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Update
Have removed what was previously the first solution after reading the update to the question.

Create lower triangle genetic distance matrix

I have distance matrix like this
1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6
and now I want to create lower triangle matrix like this
1 2 3 4 5 A B C D E
1 0
2 0.1 0
3 0.2 0.1 0
4 0.4 0.3 0.2 0
5 0.5 0.4 0.3 0.1 0
A 0.1 0.2 0.3 0.5 0.6 0
B 0.7 0.8 0.9 1 1.1 0.6 0
C 1.2 1.3 1.4 1.5 1.6 1.1 0.5 0
D 1.7 1.8 1.9 2 2.1 1.6 1 0.5 0
E 2.2 2.3 2.4 2.5 2.6 2.1 1.5 1 0.5 0
I just deducted distance between 2 from 1 from first table to get genetic distance between 1 and 2 (0.2 - 0.1=0.1) and like this I did for rest of the entries and I do not know doing like this is correct or not?, after doing calculation like that made lower triangle matrix. I tried like this in R
x <- read.csv("AD2.csv", head = FALSE, sep = ",")
b<-lower.tri(b, diag = FALSE)
but I am getting only TRUE and FALSE as output not like distance matrix.
can any one help to solve this problem and here is link to my example data.
You can make use of dist to calculate sub-matrices. Then use cbind and create the top and bottom half. Then rbind the 2 halves. Then set upper triangular to NA to create the desired output.
mat <- rbind(
cbind(as.matrix(dist(tbl[1,])), tbl),
cbind(tbl, as.matrix(dist(tbl[,1])))
)
mat[upper.tri(mat, diag=FALSE)] <- NA
mat
Hope it helps.
data:
tbl <- as.matrix(read.table(text="1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6", header=TRUE, check.names=FALSE, row.names=1))

instantaneous velocity - reference previous value

Using a very simple equation, how can I calculate the instantaneous velocity.
Vi = V0 + acceleration * time
The following task is very easy with MS.Excel as one can click on the previous previous cell, but how do we call this in R?
acceleration <- c(1,2,3,4,5,4,3,2,1)
time <- rep(0.1,9)
df1 <- data.frame(acceleration, time)
df1$instant.vel <- df1$acceleration * df1$time + ....
Try using dplyr::lag
library(dplyr)
df1 %>%
mutate(V=(lag(acceleration,default=0)*lag(time,default=0))+(acceleration*time))
acceleration time V
1 1 0.1 0.1
2 2 0.1 0.3
3 3 0.1 0.5
4 4 0.1 0.7
5 5 0.1 0.9
6 4 0.1 0.9
7 3 0.1 0.7
8 2 0.1 0.5
9 1 0.1 0.3
Or step by step:
df1 %>%
mutate(V0=(acceleration*time)) %>%
mutate(V1=V0+(lag(acceleration,default=0)*lag(time,default=0)))
acceleration time V0 V1
1 1 0.1 0.1 0.1
2 2 0.1 0.2 0.3
3 3 0.1 0.3 0.5
4 4 0.1 0.4 0.7
5 5 0.1 0.5 0.9
6 4 0.1 0.4 0.9
7 3 0.1 0.3 0.7
8 2 0.1 0.2 0.5
9 1 0.1 0.1 0.3

Summing Values based on Hour and Month and Re-arranging Summed Time Series

I am trying to aggregate (sum) values across months and hours and re-arrange the summed data so that hour and month are on different "axes". I would like the hour to be column headers and the month to be row headers with summed values in each cell. Here's what I mean, through a dummy data example (obviously 12 months are present and 24 hours in the real data):
Month <- c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,7,8,9,10,11,12)
Hour <- c(4,1,3,2,5,5,1,4,3,6,0,0,2,3,1,2,3,4,5,6,2)
Value <- c(0.1,0.4,0.02,0.1,0.1,0.2,0.02,0.01,0.01,0.02,0.1,0.3,0.2,0.1,0.2, 0.1,0.3,0.1,0.01,0.01,0.1)
z <- data.frame(Month, Hour, Value)
head(z)
Month Hour Value
1 4 0.10
1 1 0.40
2 3 0.02
2 2 0.10
3 5 0.10
3 5 0.20
My desired output, Hour = column headers (there will be 24 total, this just shows first 6 hours), Month = row headers (there will be 12 total)
z
0 1 2 3 4 5 6
1 0.3 0.2 0.1 0.7 0.1 1.1 0.7
2 0.1 0.1 0.8 1.7 0.2 0.1 0.6
3 0.2 0.7 0.1 0.4 2.1 1.3 0.1
4 0.1 0.2 0.2 0.1 3.1 0.1 0.7
5 0.7 0.8 1.2 0.2 0.4 0.1 0.2
6 0.5 0.2 3.0 0.8 0.2 5.1 1.2
7 0.5 0.2 3.0 0.8 0.2 5.1 1.2
8 0.5 0.2 3.0 0.8 0.2 5.1 1.2
9 0.5 0.2 3.0 0.8 0.2 5.1 1.2
10 0.5 0.2 3.0 0.8 0.2 5.1 1.2
11 0.5 0.2 3.0 0.8 0.2 5.1 1.2
12 0.5 0.2 3.0 0.8 0.2 5.1 1.2
We can use xtabs to create a contingency table
xtabs(Value~Month+Hour)

Fill nth columns in a dataframe

I have this data frame:
df <- data.frame(A=c("a","b","c","d","e","f","g","h","i"),
B=c("1","1","1","2","2","2","3","3","3"),
C=c(0.1,0.2,0.4,0.1,0.5,0.7,0.1,0.2,0.5))
> df
A B C
1 a 1 0.1
2 b 1 0.2
3 c 1 0.4
4 d 2 0.1
5 e 2 0.5
6 f 2 0.7
7 g 3 0.1
8 h 3 0.2
9 i 3 0.5
I would like to add 1000 further columns and fill this columns with the values generated by :
transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
I've tried with a for loop but it does not work:
for (i in 4:1000){
df[, 4:1000] <- NA
df[,i] = transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
}
For efficiency reasons, I suggest running sample only once for each group. This can be achieved with this:
sample2 <- function(x, size)
{
if(length(x)==1) rep(x, size) else sample(x, size, replace=TRUE)
}
new_df <- do.call(rbind, by(df, df$B,
function(d) cbind(d, matrix(sample2(d$C, length(d$C)*1000),
ncol=1000))))
Notes:
I've created sample2 in case there is a group with only one C value. Check ?sample to see what I mean.
The names of the columns will be numbers, from 1 to 1000. This can be changed as in the answer by #agstudy.
The row names are also changed. "Fixing" them is similar, just use row.names instead of col.names.
Using replicate for example:
cbind(df,replicate(1000,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
To add 4 columns for example:
cbind(df,replicate(4,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
A B C 1 2 3 4
1 a 1 0.1 0.2 0.2 0.1 0.2
2 b 1 0.2 0.4 0.2 0.4 0.4
3 c 1 0.4 0.1 0.1 0.1 0.1
4 d 2 0.1 0.1 0.5 0.5 0.1
5 e 2 0.5 0.7 0.1 0.5 0.1
6 f 2 0.7 0.1 0.7 0.7 0.7
7 g 3 0.1 0.2 0.5 0.2 0.2
8 h 3 0.2 0.2 0.1 0.2 0.1
9 i 3 0.5 0.5 0.5 0.1 0.5
Maybe you need to rename columns by something like :
gsub('([0-9]+)','D\\1',colnames(res))
1] "A" "B" "C" "D1" "D2" "D3" "D4"

Resources