Equalize number of trials in two data subsets based on overlapping distributions - r

I did an experiment where subjects (n = 14) had to respond on a keyboard to stimuli presented on a screen. They could get a monetary penalty for incorrect responses in two different conditions --> these two conditions are called Penalty 4 and Penalty 14 in the followings. I measured the decision time (DT) of these subjects in the task among other variables.
All the data are present in a table called 'OutputTable_Online'. Here is what OutputTable_Online looks like (top part):
OutputTable_Online (bottom part):
What I want to do is to average the variables named "ampl_RFDI_sb", "ampl_RAPB_sb", "ampl_RADM_sb" for each 'Subjectnbr' and each 'Penalty' as a function of the 'StimType'. All the information I need is in OutputTable_Online as can be seen in the images above. Here is the code I use for this:
Melt_OutputTable_Online <- melt(OutputTable_Online,
id.var = c('Subjectnbr', 'Penalty','Trial_Nbr',
'StimType'), measure.var = c('ampl_RFDI_sb', 'ampl_RAPB_sb', 'ampl_RADM_sb',
'ampl_LFDI_sb', 'ampl_LAPB_sb', 'ampl_LADM_sb', 'ampl_RFDI_ss',
'ampl_RAPB_ss', 'ampl_RADM_ss', 'ampl_LFDI_ss', 'ampl_LAPB_ss',
'ampl_LADM_ss', 'ampl_RFDI_sm', 'ampl_RAPB_sm', 'ampl_RADM_sm',
'ampl_LFDI_sm', 'ampl_LAPB_sm', 'ampl_LADM_sm', 'ampl_RFDI_sl',
'ampl_RAPB_sl', 'ampl_RADM_sl', 'ampl_LFDI_sl', 'ampl_LAPB_sl',
'ampl_LADM_sl'))
Cast_Melt_OutputTable_Online <- cast(Melt_OutputTable_Online,
Subjectnbr * Penalty ~ StimType * variable, mean)
Here is the output of this process:
However, as expected, the DT distribution is shifted to the right when the penalty was 14 as subjects waited longer to respond (they were more cautious). Hence, the average DT is longer in the Penalty 14 than in the Penalty 4 condition.
The group-level density distributions for Penalty 4 (black) and Penalty 14 (green) are represented on the figure here; vertical lines represent the group-level average. Here is the code I used to plot this:
OutputTable_Online_DT <- ddply(OutputTable_Online, "Penalty", summarise,
grp.mean=mean(DT))
Density_OutputTable_Online <- ggplot(OutputTable_Online, aes(x = DT,
fill=Penalty))
Density_OutputTable_Online <- Density_OutputTable_Online +
geom_density(aes(y = ..count.., group=Penalty), alpha=0.2)+
geom_vline(data=OutputTable_Online_DT,aes(xintercept=grp.mean,
color=Penalty),linetype="dashed", size=1)+ ggtitle("Density distributions
for both penalty conditions") + scale_color_manual(labels = c("P4", "P14"),
values = c("black", "green"))+ scale_fill_manual(labels = c("P4", "P14"),
values = c("black", "green"))+ labs(x = "DT (ms)", y = "Density of trials
(a.u.)")+ coord_cartesian(ylim=c(0, 3.5), xlim=c(0, 3000))
Density_OutputTable_Online
Here is my issue: when I do the averaging for the variables "ampl_RFDI_sb", "ampl_RAPB_sb", "ampl_RADM_sb", etc as described above, the resulting averages could actually depend on the DT (i.e., as DT is different in the 2 penalty conditions). I would like to get rid of this confounding factor. To do so, **I would like to homogenize the average DT across the two penalty conditions in each subject. I was thinking that one way to do so would be to select in each subject the trials present in the fraction of the distributions plotted above that overlap each other (i.e., where the green distrib overlap the black distrib). Put differently, I would like to have the same distribution of trials in each penalty condition in OutputTable_Online when I plot the density distribution of DTs before doing the averaging procedure for the variables "ampl_RFDI_sb", "ampl_RAPB_sb", "ampl_RADM_sb", etc.
One way to do this would be to equalize the number trials in the Penalty 4 and Penalty 14 conditions in each DT bin. However, I have no idea of how to do this based on the data present in OutputTable_Online as described above.**
Any tip would be very welcome.
Thank you in advance for your help,
Gerard

To be able to get the same DT in both penalty conditions above, I subsetted the table based on bins of DT (using the subset function) and homogenized, for each bin, the number of trials in each condition based on the condition that had the lowest number of trials. To do so, I used the "sample" function. I did that for each subject of the table using a for loop. Here is the code:
# Loop for each Subject.
for (s in c(unique(DF_ampl_sb$Subjectnbr)))
{
tmp1<- subset(DF_ampl_sb,subset=Subjectnbr==s)
tmp2<- subset(tmp1,subset=DT>1&DT<=250)
tmp3<- subset(tmp1,subset=DT>250&DT<=500)
tmp4<- subset(tmp1,subset=DT>500&DT<=750)
tmp5<- subset(tmp1,subset=DT>750&DT<=1000)
tmp6<- subset(tmp1,subset=DT>1000&DT<=1250)
tmp7<- subset(tmp1,subset=DT>1250&DT<=1500)
tmp8<- subset(tmp1,subset=DT>1500&DT<=1750)
tmp9<- subset(tmp1,subset=DT>1750&DT<=2000)
tmp10<- subset(tmp1,subset=DT>2000&DT<=2250)
tmp11<- subset(tmp1,subset=DT>2250&DT<=2500)
tmp12<- subset(tmp1,subset=DT>2500&DT<=2750)
tmp13<- subset(tmp1,subset=DT>2750&DT<=3000)
tmp2_Penalty1<- subset(tmp2,subset=Penalty==1)
tmp2_Penalty2<- subset(tmp2,subset=Penalty==2)
tmp2_Penalty1<- tmp2_Penalty1[sample(nrow(tmp2_Penalty1), min(dim(tmp2_Penalty2)
[1],dim(tmp2_Penalty1)[1])), ]
tmp2_Penalty2<- tmp2_Penalty2[sample(nrow(tmp2_Penalty2), min(dim(tmp2_Penalty2)
[1],dim(tmp2_Penalty1)[1])), ]
tmp3_Penalty1<- subset(tmp3,subset=Penalty==1)
tmp3_Penalty2<- subset(tmp3,subset=Penalty==2)
tmp3_Penalty1<- tmp3_Penalty1[sample(nrow(tmp3_Penalty1), min(dim(tmp3_Penalty2)[1],dim(tmp3_Penalty1)[1])), ]
tmp3_Penalty2<- tmp3_Penalty2[sample(nrow(tmp3_Penalty2), min(dim(tmp3_Penalty2)[1],dim(tmp3_Penalty1)[1])), ]
tmp4_Penalty1<- subset(tmp4,subset=Penalty==1)
tmp4_Penalty2<- subset(tmp4,subset=Penalty==2)
tmp4_Penalty1<- tmp4_Penalty1[sample(nrow(tmp4_Penalty1), min(dim(tmp4_Penalty2)
[1],dim(tmp4_Penalty1)[1])), ]
tmp4_Penalty2<- tmp4_Penalty2[sample(nrow(tmp4_Penalty2), min(dim(tmp4_Penalty2)
[1],dim(tmp4_Penalty1)[1])), ]
tmp5_Penalty1<- subset(tmp5,subset=Penalty==1)
tmp5_Penalty2<- subset(tmp5,subset=Penalty==2)
tmp5_Penalty1<- tmp5_Penalty1[sample(nrow(tmp5_Penalty1), min(dim(tmp5_Penalty2)
[1],dim(tmp5_Penalty1)[1])), ]
tmp5_Penalty2<- tmp5_Penalty2[sample(nrow(tmp5_Penalty2), min(dim(tmp5_Penalty2)
[1],dim(tmp5_Penalty1)[1])), ]
tmp6_Penalty1<- subset(tmp6,subset=Penalty==1)
tmp6_Penalty2<- subset(tmp6,subset=Penalty==2)
tmp6_Penalty1<- tmp6_Penalty1[sample(nrow(tmp6_Penalty1), min(dim(tmp6_Penalty2)[1],dim(tmp6_Penalty1)[1])), ]
tmp6_Penalty2<- tmp6_Penalty2[sample(nrow(tmp6_Penalty2), min(dim(tmp6_Penalty2)[1],dim(tmp6_Penalty1)[1])), ]
tmp7_Penalty1<- subset(tmp7,subset=Penalty==1)
tmp7_Penalty2<- subset(tmp7,subset=Penalty==2)
tmp7_Penalty1<- tmp7_Penalty1[sample(nrow(tmp7_Penalty1), min(dim(tmp7_Penalty2)[1],dim(tmp7_Penalty1)[1])), ]
tmp7_Penalty2<- tmp7_Penalty2[sample(nrow(tmp7_Penalty2), min(dim(tmp7_Penalty2)[1],dim(tmp7_Penalty1)[1])), ]
tmp8_Penalty1<- subset(tmp8,subset=Penalty==1)
tmp8_Penalty2<- subset(tmp8,subset=Penalty==2)
tmp8_Penalty1<- tmp8_Penalty1[sample(nrow(tmp8_Penalty1), min(dim(tmp8_Penalty2)
[1],dim(tmp8_Penalty1)[1])), ]
tmp8_Penalty2<- tmp8_Penalty2[sample(nrow(tmp8_Penalty2), min(dim(tmp8_Penalty2)
[1],dim(tmp8_Penalty1)[1])), ]
tmp9_Penalty1<- subset(tmp9,subset=Penalty==1)
tmp9_Penalty2<- subset(tmp9,subset=Penalty==2)
tmp9_Penalty1<- tmp9_Penalty1[sample(nrow(tmp9_Penalty1), min(dim(tmp9_Penalty2)
[1],dim(tmp9_Penalty1)[1])), ]
tmp9_Penalty2<- tmp9_Penalty2[sample(nrow(tmp9_Penalty2), min(dim(tmp9_Penalty2)
[1],dim(tmp9_Penalty1)[1])), ]
tmp10_Penalty1<- subset(tmp10,subset=Penalty==1)
tmp10_Penalty2<- subset(tmp10,subset=Penalty==2)
tmp10_Penalty1<- tmp10_Penalty1[sample(nrow(tmp10_Penalty1), min(dim(tmp10_Penalty2)
[1],dim(tmp10_Penalty1)[1])), ]
tmp10_Penalty2<- tmp10_Penalty2[sample(nrow(tmp10_Penalty2), min(dim(tmp10_Penalty2)
[1],dim(tmp10_Penalty1)[1])), ]
tmp11_Penalty1<- subset(tmp11,subset=Penalty==1)
tmp11_Penalty2<- subset(tmp11,subset=Penalty==2)
tmp11_Penalty1<- tmp11_Penalty1[sample(nrow(tmp11_Penalty1), min(dim(tmp11_Penalty2)
[1],dim(tmp11_Penalty1)[1])), ]
tmp11_Penalty2<- tmp11_Penalty2[sample(nrow(tmp11_Penalty2), min(dim(tmp11_Penalty2)
[1],dim(tmp11_Penalty1)[1])), ]
tmp12_Penalty1<- subset(tmp12,subset=Penalty==1)
tmp12_Penalty2<- subset(tmp12,subset=Penalty==2)
tmp12_Penalty1<- tmp12_Penalty1[sample(nrow(tmp12_Penalty1), min(dim(tmp12_Penalty2)
[1],dim(tmp12_Penalty1)[1])), ]
tmp12_Penalty2<- tmp12_Penalty2[sample(nrow(tmp12_Penalty2), min(dim(tmp12_Penalty2)
[1],dim(tmp12_Penalty1)[1])), ]
tmp13_Penalty1<- subset(tmp13,subset=Penalty==1)
tmp13_Penalty2<- subset(tmp13,subset=Penalty==2)
tmp13_Penalty1<- tmp13_Penalty1[sample(nrow(tmp13_Penalty1), min(dim(tmp13_Penalty2)
[1],dim(tmp13_Penalty1)[1])), ]
tmp13_Penalty2<- tmp13_Penalty2[sample(nrow(tmp13_Penalty2), min(dim(tmp13_Penalty2)
[1],dim(tmp13_Penalty1)[1])), ]
# Add the content to the data frame (DF_rms_sb) by binding the data (row-binding).
DF_ampl_sb_tmp <- rbind (DF_ampl_sb_tmp,tmp2_Penalty1, tmp2_Penalty2, tmp3_Penalty1,
tmp3_Penalty2, tmp4_Penalty1, tmp4_Penalty2, tmp5_Penalty1, tmp5_Penalty2,
tmp6_Penalty1, tmp6_Penalty2, tmp7_Penalty1, tmp7_Penalty2, tmp8_Penalty1,
tmp8_Penalty2, tmp9_Penalty1, tmp9_Penalty2, tmp10_Penalty1, tmp10_Penalty2,
tmp11_Penalty1, tmp11_Penalty2, tmp12_Penalty1, tmp12_Penalty2,tmp13_Penalty1,
tmp13_Penalty2)
# Remove objects from a specified environment.
rm(tmp1, tmp2_Penalty1, tmp2_Penalty2, tmp3_Penalty1, tmp3_Penalty2, tmp4_Penalty1,
tmp4_Penalty2, tmp5_Penalty1, tmp5_Penalty2, tmp6_Penalty1, tmp6_Penalty2,
tmp7_Penalty1, tmp7_Penalty2, tmp8_Penalty1, tmp8_Penalty2, tmp9_Penalty1,
tmp9_Penalty2, tmp10_Penalty1, tmp10_Penalty2, tmp11_Penalty1, tmp11_Penalty2,
tmp12_Penalty1, tmp12_Penalty2, tmp13_Penalty1, tmp13_Penalty2)
}
}
dim(DF_ampl_sb_tmp)
DF_ampl_sb <- DF_ampl_sb_tmp
There might be another way of subseting the table, here I defined the bins manually in the loop (i.e. from tmp2 to tmp13). However, it already works pretty well. Here is the kind of distribution I obtain before using the code: enter image description here
And after, using it:
enter image description here
Gerard

Related

calculate new random number considering distribution of already existing numbers in r

I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.

How to generate a random sample according with a vector of different probabilities in a single command in R?

I need to simulate a vote cast ([0]=Reject, [1]=Approve) of a fictitious population according with their "probability to approve" (i.e., their probability to cast a vote [1]). Each individual (id) has a home state (uf) which has supposedly a different probability to approve (prob_approve) and which is considered known in this toy example.
Here is an example of the data:
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
I tried:
x <- list(prob = pop_tab$prob_approve)
vote <- lapply(x, runif)
... but I don't think the 'runif()' function was processed with the probabilities on column "prop_approve".
How could I simulate the vote cast of the population, according with their home-state probabilities, in a single command, without having to process line by line in a for loop?
Thank you in advance.
Use rbinom():
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
rbinom(n = nrow(pop_tab),
size = 1,
prob = pop_tab$prob_approve)
## [1] 0 0 1 0 0 1 1 1 1 1

R "Error in draw.quad.venn, Impossible: produces negative area" despite numbers being correct

I'm trying to generate a four way Venn diagram using draw.quad.venn in the VennDiagram package in R, but it keeps throwing up the error message:
ERROR [2019-05-14 11:28:24] Impossible: a7 <- n234 - a6 produces negative area
Error in draw.quad.venn(length(gene_lists[[1]]), length(gene_lists[[2]]), :
Impossible: a7 <- n234 - a6 produces negative area
I'm using 4 different lists of genes as the input. calculate.overlap works fine, then I get the numbers by using the length(x) function over the overlap values, parsed as a list. I pass all of the overlap values, along with the appropriate total group sizes, to the draw.quad.venn function, but it keeps claiming that one of the groups is impossible because it generates a negative number.
I've checked the numbers manually and they clearly add up to the correct values. I've also tested the script on a random set of 20000 genes, generated using something similar to the script below, and it works fine i.e. generates a four way Venn diagram. There are no differences between the randomly generated gene lists and the ones I've curated from actual results files, apart from their sizes. A minimal working example can be seen below:
# working example that fails
# get vector of 10000 elements (representative of gene list)
values <- c(1:10000)
# generate 4 subsets by random sampling
list_1 <- sample(values, size = 5000, replace = FALSE)
list_2 <- sample(values, size = 4000, replace = FALSE)
list_3 <- sample(values, size = 3000, replace = FALSE)
list_4 <- sample(values, size = 2000, replace = FALSE)
# compile them in to a list
lists <- list(list_1, list_2, list_3, list_4)
# find overlap between all possible combinations (11 plus 4 unique to each list = 15 total)
overlap <- calculate.overlap(lists)
# get the lengths of each list - these will be the numbers used for the Venn diagram
overlap_values <- lapply(overlap, function(x) length(x))
# rename overlap values (easier to identify which groups are intersecting)
names(overlap_values) <- c("n1234", "n123", "n124", "n134", "n234", "n12", "n13", "n14", "n23", "n24", "n34", "n1", "n2", "n3", "n4")
# generate the venn diagram
draw.quad.venn(length(lists[[1]]), length(lists[[2]]), length(lists[[3]]), length(lists[[4]]), overlap_values$n12,
overlap_values$n13, overlap_values$n14, overlap_values$n23, overlap_values$n24, overlap_values$n34,
overlap_values$n123, overlap_values$n124, overlap_values$n134, overlap_values$n234, overlap_values$n1234)
I expect a four way Venn diagram regardless of whether or not some groups are 0, they should still be there, but labelled as 0. This is what it should look like:
I'm not sure if it's because I have 0 values in the real data i.e. certain groups where there is no overlap? Is there any way to force draw.quad.venn() to take any values? If not, is there another package that I can use to achieve the same results? Any help greatly appreciated!
So nothing I tried could solve the error with the draw.quad.venn in the VennDiagram package. There's something wrong with the way it's written. As long as all of the numbers in each of the 4 ellipses add up to the total number of elements in that particular list, the Venn diagram is valid. For some reason, VennDiagram will only accept data where fewer intersections lead to higher numbers e.g. the intersection of groups 1, 2 and 3 MUST be higher than the intersection of all 4 groups. This doesn't represent real world data. It's entirely possible for groups 1, 2 and 3 to not intersect at all, whilst all 4 groups do intersect. In a Venn diagram, all of the numbers are independent, and represent the total number of elements common at each intersection. They do not have to have any bearing on each other.
I had a look at the eulerr package, but actually found a very simple method of plotting the venn diagram using venn in gplots, as follows:
# simple 4 way Venn diagram using gplots
# get some mock data
values <- c(1:20000)
list_1 <- sample(values, size = 5000, replace = FALSE)
list_2 <- sample(values, size = 4000, replace = FALSE)
list_3 <- sample(values, size = 3000, replace = FALSE)
list_4 <- sample(values, size = 2000, replace = FALSE)
lists <- list(list_1, list_2, list_3, list_4)
# name thec list (required for gplots)
names(lists) <- c("G1", "G2", "G3", "G4")
# get the venn table
v.table <- venn(lists)
# show venn table
print(v.table)
# plot Venn diagram
plot(v.table)
I now consider the matter solved. Thank you zx8754 for your help!
I have had a look at the source code of the package. In case you are still interested in the reason for the error, there are two ways to send data to venn.diagram. One is the nxxxx (e. g., n134) form and the other is the an (e. g., a5) form. In the examples, n134 means "which elements belong at least to groups 1, 3 and 4". On the other hand, a5 means "which elements only belong to groups 1, 3 and 4". The relationship between both forms is really convoluted, for instance a6 corresponds to n1234. This means that n134 = a5 + a6.
The problem is that calculate.overlap gives the numbers in the an form, whereas by default draw.quad.venn expects numbers in the nxxxx form. To use the values from calculate.overlap, you can set direct.area to true and provide the result of calculate.overlap in the area.vector parameter. For instance,
tmp <- calculate.overlap(list(a=c(1, 2, 3, 4, 10), b=c(3, 4, 5, 6), c=c(4, 6, 7, 8, 9), d=c(4, 8, 1, 9)))
overlap_values <- lapply(tmp, function(x) length(x))
draw.quad.venn(area.vector = c(overlap_values$a1, overlap_values$a2, overlap_values$a3, overlap_values$a4,
overlap_values$a5, overlap_values$a6, overlap_values$a7, overlap_values$a8,
overlap_values$a9, overlap_values$a10, overlap_values$a11, overlap_values$a12,
overlap_values$a13, overlap_values$a14, overlap_values$a15), direct.area = T, category = c('a', 'b', 'c', 'd'))
If you are interested in something simpler and more flexible, I made the nVennR package for this type of problems:
library(nVennR)
g1 <- c('AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539', 'NM_000587', 'NM_000593', 'NM_000638', 'NM_000655', 'NM_000789', 'NM_000873', 'NM_000955', 'NM_000956', 'NM_000958', 'NM_000959', 'NM_001060', 'NM_001078', 'NM_001495', 'NM_001627', 'NM_001710', 'NM_001716')
g2 <- c('NM_001728', 'NM_001835', 'NM_001877', 'NM_001954', 'NM_001992', 'NM_002001', 'NM_002160', 'NM_002162', 'NM_002258', 'NM_002262', 'NM_002303', 'NM_002332', 'NM_002346', 'NM_002347', 'NM_002349', 'NM_002432', 'NM_002644', 'NM_002659', 'NM_002997', 'NM_003032', 'NM_003246', 'NM_003247', 'NM_003248', 'NM_003259', 'NM_003332', 'NM_003383', 'NM_003734', 'NM_003830', 'NM_003890', 'NM_004106', 'AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539')
g3 <- c('NM_000655', 'NM_000789', 'NM_004107', 'NM_004119', 'NM_004332', 'NM_004334', 'NM_004335', 'NM_004441', 'NM_004444', 'NM_004488', 'NM_004828', 'NM_005214', 'NM_005242', 'NM_005475', 'NM_005561', 'NM_005565', 'AF029684', 'M28825', 'M32074', 'NM_005567', 'NM_003734', 'NM_003830', 'NM_003890', 'NM_004106', 'AF029684', 'NM_005582', 'NM_005711', 'NM_005816', 'NM_005849', 'NM_005959', 'NM_006138', 'NM_006288', 'NM_006378', 'NM_006500', 'NM_006770', 'NM_012070', 'NM_012329', 'NM_013269', 'NM_016155', 'NM_018965', 'NM_021950', 'S69200', 'U01351', 'U08839', 'U59302')
g4 <- c('NM_001728', 'NM_001835', 'NM_001877', 'NM_001954', 'NM_005214', 'NM_005242', 'NM_005475', 'NM_005561', 'NM_005565', 'ex1', 'ex2', 'NM_003890', 'NM_004106', 'AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539')
myV <- plotVenn(list(g1=g1, g2=g2, g3=g3, g4=g4))
myV <- plotVenn(nVennObj = myV)
myV <- plotVenn(nVennObj = myV)
The last command is repeated on purpose. The result:
You can then explore the intersections:
> getVennRegion(myV, c('g1', 'g2', 'g4'))
[1] "NM_000139" "NM_000173" "NM_000208" "NM_000316" "NM_000318" "NM_000450" "NM_000539"
There is a vignette with more information.

Trying to program trading signals in R

I new new to R and am trying to program a pair trading strategy in R.
I have already written the code for downloading the data. And have created additional columns and prepared the data. Now i need to calculate the trading signals.
My signal rules are as follows.
- If Z-Score is greater than 2.25 , Sell the pair; Buy back when Z-Score is less than 0.25.
- If Z-Score is less than -2.25 , Buy the pair; sell (Exit) when z-score is above -0.25.
- close any open position if there is a change in signal.
When we sell a pair, we sell the first stock and buy the second stock. In this case, we sell ACC and Buy Ambujacem.
When we buy a pair, we buy the first stock and sell the second stock. In this case, we buy ACC and Sell Ambujacem.
Could anyone help me with the coding for the trading signals.
Enclosing the code.
Regards,
Subash
# Trading Code
library(quantmod)
getSymbols("ACC.NS", from=as.Date('2007-01-01'), to=as.Date('2015-07-24'))
getSymbols("AMBUJACEM.NS", from=as.Date('2007-01-01'), to=as.Date('2015-07-24'))
acc=ACC.NS[,6]
amb=AMBUJACEM.NS[,6]
t.zoo <- merge(acc, amb, all=TRUE)
t.zoo=as.data.frame(t.zoo)
typeof(t.zoo)
t.zoo=na.omit(t.zoo)
#adding columns
t.zoo$spread <- 0
t.zoo$adfTest <- 0
t.zoo$mean <- 0
t.zoo$stdev <- 0
t.zoo$zScore <- 0
t.zoo$signal <- 0
t.zoo$BuyPrice <- 0
t.zoo$SellPrice <- 0
t.zoo$LongReturn <- 0
t.zoo$ShortReturn <- 0
t.zoo$Slippage <- 0
t.zoo$TotalReturn <- 0
#preparing the data
#Calculating the pair ratio
t.zoo$pairRatio <- t.zoo$ACC.NS.Adjusted/t.zoo$AMBUJACEM.NS.Adjusted
#Calculate the log prices of the two time series
t.zoo$LogA <- log10(t.zoo$ACC.NS.Adjusted)
t.zoo$LogB <- log10(t.zoo$AMBUJACEM.NS.Adjusted)
#Calculating the spread
t.zoo$spread <- t.zoo$ACC.NS.Adjusted/t.zoo$AMBUJACEM.NS.Adjusted
#Calculating the mean
# Computes the mean using the SMA function
# choose the number of days for calculating the mean
SMAdays = 20
t.zoo$mean <- SMA(t.zoo$spread,SMAdays)
#Calculating the Std Deviation
t.zoo$stdev <- rollapply(t.zoo$spread,20,sd, fill=NA, align='right')
#Calculating the Z Score
t.zoo$zScore <- (t.zoo$pairRatio - t.zoo$mean)/t.zoo$spread
View(t.zoo)
#Calculation of trading signals and trading prices
#Trigger sell or buy signal if Z Score moves above 2.25 or below -2.25.
# Close position if Z Score reaches 0.2 or -0.2.
# close any open position if there is a change in signal.
I think the main issue was to come up with trading signals for a strategy that depends not only on the current level of indicator but also on the direction from which the indicator is crossed.
There were a number of problems with the code posted in comments, including use of single = for comparisons . So I've worked it afresh
Here's my attempt at solving this. It seems to be fine. I've added some plotting code to eyeball the results. I suggest you check the result over different periods.
This code comes after the one in the original question . Only difference is that I have kept t.zoo as an xts/zoo object and not converted it to data.frame. Also, I've multiplied zScores with 100
It generates trigger dates and also a column depicting the state of strategy. Calculating returns would be easy from there
colnames(t.zoo)
#t.zoo must be an xts object
#working on a separate xts object
sigs<- t.zoo[, c("ACC.NS.Adjusted", "AMBUJACEM.NS.Adjusted" , "zScore")]
# creating my own triggers as there are not enough good values
# buyTrig<- mean(t.zoo$zScore ,na.rm = T) - 1*sd(t.zoo$zScore ,na.rm = T)
# sellTrig<- (-1) * buyTrig
# sqOffTrig<- mean(t.zoo$zScore ,na.rm = T) - 0.5*sd(t.zoo$zScore ,na.rm = T)
# Another approach: scaling tz.zoo to fit your criterion
sigs$zScore<- sigs$zScore*100
buyTrig<- (-2.25)
sellTrig<- (-1) * buyTrig
sqOffTrig<- 0.25
cat ( buyTrig, sellTrig , sqOffTrig)
hist(sigs$zScore, breaks = 40)
abline(v=c(buyTrig,sellTrig), col="red")
abline(v=c(-sqOffTrig, sqOffTrig), col="green")
sum(sigs$zScore >= -sqOffTrig & sigs$zScore<= sqOffTrig , na.rm = T) # 139
sigs$action<- 0
sigs$mode <- NA
sigs$zLag<- lag.xts(sigs$zScore,1)
sigs[19:22,]
#these are not the real trigger dates, but they will serve our purpose
# along with na.locf
buyTrigDays<- time(sigs[sigs$zScore<= buyTrig & sigs$zLag > buyTrig, ])
sellTrigDays<- time(sigs[sigs$zScore>= sellTrig & sigs$zLag < sellTrig, ])
#square offs
buySqOffDays<- time( sigs[sigs$zScore>= (-1*sqOffTrig) & sigs$zLag < (-1*sqOffTrig), ] )
buySqOffDays
sellSqOffDays<- time( sigs[sigs$zScore<= (sqOffTrig) & sigs$zLag > (sqOffTrig), ] )
sellSqOffDays
sigs$mode[buyTrigDays]=1 ; sigs$mode[sellTrigDays]= -1;
sigs$mode[buySqOffDays]=0 ; sigs$mode[sellSqOffDays]= 0;
sigs$mode
# use local fill to repeat these triggered position into future
# till you meet another non NA value
sigs$mode<- na.locf(sigs$mode, fromLast = F)
plot((sigs$zScore["2015"] ))
points(sigs$zScore[sigs$mode==1], col="red", on=1, pch = 19)
points(sigs$zScore[sigs$mode==-1], col="green", on=1 , pch = 19)
points(sigs$zScore[sigs$mode==0], col="blue", on=1)
sum(is.na(sigs$mode))
#now to get the real dates when square off is triggered
trigdays<- time( sigs[diff(sigs$mode,1) != 0, ] ) #when the value changes
squareOffTrigger_real<- time(sigs[sigs$mode==0][trigdays])
buyTrigger_real<- time(sigs[sigs$mode==1] [trigdays])
sellTrigger_real<- time(sigs[sigs$mode==-1][trigdays])
#check
length(sellTrigger_real) + length(buyTrigger_real) == length(squareOffTrigger_real)
plot(sigs$zScore["2015"])
points(sigs$zScore[buyTrigger_real] , col="blue", pch = 19, on=1)
points(sigs$zScore[sellTrigger_real] , col="red", pch = 19, on=1)
points(sigs$zScore[squareOffTrigger_real] , col="green", pch = 19, on=1)
abline(h=c(-sqOffTrig, sqOffTrig) , col= "green" )
# further calculations can be easily made using either the mode
# column or the trigger dates computed at the end

Distance matrix

I am trying to determine the distance between everypoint in one data set vs the other data set in R. Each data set has an X and Y parameter. I have been converting the data sets into data frames and the finding the distance. However my current code creates a large matrix to due this listing both the data sets as columns and rows. I then need to identify a specific part of the matrix I care about to get my answers, Is there a way just to put DSA as the columns and DSB as the rows. this whould cut the matrix in 1/4 which since my data sets contain thousands of points each whould really cut down the time for the algorithum to run
Here is the code I am using
tumor<-data.frame(DSA[,c ("X_Parameter","Y_Parameter")])
cells<-data.frame(DSB[,c ("X_Parameter","Y_Parameter")])
distances<-as.matrix(dist(rbind(tumor,cells)))
row.start<-nrow(tumor)+1
row.end<-nrow(tumor)+nrow(cells)
col.start<-1
col.end<-nrow(tumor)
distances[row.start:row.end, col.start:col.end]
d<- distances[row.start:row.end, col.start:col.end]
Try flexclust::dist2:
n_tumor = 2000
n_cells = 2000
tumor = matrix(runif(n_tumor * 2), n_tumor, )
cells = matrix(runif(n_cells * 2), n_cells, )
t_dist = system.time({
distances<-as.matrix(dist(rbind(tumor,cells)))
row.start<-nrow(tumor)+1
row.end<-nrow(tumor)+nrow(cells)
col.start<-1
col.end<-nrow(tumor)
d <- distances[row.start:row.end, col.start:col.end]
})[3]
require(flexclust)
t_dist2 = system.time({d2 = dist2(x = cells, y = tumor, method = "euclidean")})[3]
t_dist # 1.477
t_dist2 # 0.244
identical(unname(d), d2) # TRUE
EDIT:
Another alternative is proxy::dist.
This will compute only the portion of the matrix you need:
tumoridx <- rep(1:nrow(tumor), each=nrow(cells)
cellsidx <- rep(1:nrow(cells), nrow(tumor))
tcdist <- matrix(sqrt(rowSums((tumor[tumoridx, ] - cells[cellsidx, ])^2)),
nrow(cells), nrow(tumor))

Resources