Chart with built in groupby and secondary Y %s in r - r

Thanks for this wonderful community and expert responses. This is my first question on stackoverflow. I did research but couldn't find what I am trying to do.
How to write an efficient code in r that will create a chart with secondary Y and also does the groupby for total counts based on a certain variable? I want groupby to operate within the code rather than having to create a separate dataframe for every variable that I want to plot on X.
I have thousands of rows and hundreds of columns in an r dataframe. My sample data looks like this. (20 x 5)
tv = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0)
pr1 =c("AA", "AB", "ZH", "AA", "ZA", "AB", "ZA", "ZA", "AA", "AA", "ZA", "AA", "ZG", "AA", "ZF", "AB", "AA", "AB", "AA", "AA")
pr2 =c("B", "F", "F", "J", "E", "E", "J", "B", "J", "F", "B", "B", "J", "B", "F", "J", "B", "F", "B", "E")
pr3 =c(13, 13, 25, 13, 13, 13, 13, 1, 13, 13, 13, 13, 25, 13, 25, 1, 13, 13, 13, 13)
sample_data = data.frame("SN"= c(1:20),"Target_Vbl"=tv,Predictor_1=pr1,Predictor_2=pr2,Predictor_3=pr3)
From this sample data, I can create the chart I am looking for in excel but am lost when it comes to plotting it in r. I want to re-use the code for any other predictor variable but my Y axes will always remain the same i.e. primary Y is total count of Target_Vbl and secondary Y is % of one's for a given category of Predictor variable plotted on X axis.
The chart should look like below...currently plotted for Predictor_1(drawn in excel)
Edit - After trying the plotrix
Continuing with the sample_data I created a summary data to utilize the plotrix package. (Thanks lawyeR) The twoord.plot takes me closer to what I am looking for however there are few discrepancies as below -
1. am not getting the bars for the tc (total count of predictor_1) for left Y axis...I did try mentioning the "bar" in "type" option but it did not work.
2. The X axis labels don't show the values from the data but defaults to numbers. It should show "AA", "AB", "ZA" etc...and not 1,2,3...
3. Is there a way to make the overall process more concise. I feel my code is crude at best. Any pointers would be helpful.
library(sqldf)
smry = sqldf("Select Predictor_1, count(Target_Vbl) as tc, sum(Target_Vbl)
as conv from sample_data Group by Predictor_1")
smry$ratio = round((smry$conv/smry$tc),2)
library(plotrix)
twoord.plot(smry$Predictor_1, smry$tc,
smry$Predictor_1, smry$ratio,
type= c("l", "l"), lcol=3,rcol=4,do.first="plot_bg(\"gray\")")
The graph now looks like this -
output of twoord.plot

Related

Adding extra track to outside of circos plot (circlize, chordDiagram)

I'm trying to recreate this figure below, where the "to" variable (i.e. target genes) is further grouped into outer (labelled) categories (i.e. receptors).
I have generated some example data, unfortunately I'm not sure what format is needed for the additional outer categories, but it's possibly not far off the link format.
library(circlize)
links <- data.frame(from = c("A", "B", "C", "B", "C"),
to = c("D", "E", "F", "D", "E"),
value = c(1, 1, 1, 1, 1))
categories <- data.frame(from = c("D", "E", "F", "D", "E"),
to = c("X", "X", "Y", "Y", "Y"),
value = c(1, 1, 1, 1, 1))
chordDiagram(links)
Any assistance greatly appreciated!

Bayesian Network Meta-Analysis (gemtc) - Specifying the order of comparisons

I'm working on a Bayesian Network Meta-Analysis using the gemtc package on a dataset similar to the following:
df <- data.frame(study = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "F", "F",
"G", "G", "H", "H", "I", "I", "J", "J", "K", "K"),
treatment = c("A", "B", "B", "C", "B", "C", "A", "B", "B",
"C", "B", "C", "A", "B", "B", "C", "B", "C",
"A", "C", "B", "C"),
responders = c(1, 5, 0, 0, 3, 1, 0, 2, 0, 2, 0, 2, 0,
0, 1, 2, 0, 0, 2, 9, 1, 1),
sampleSize = c(32, 33, 30, 30, 18, 20, 15, 15, 20,
20, 30, 30, 36, 32, 15, 15, 23, 22, 24, 23, 18, 16))
While I have been able to set up the network model and run the analysis just fine, I have been struggling with specifying the order in which I would like the treatments to be compared in the node-splitting consistency analysis. For example, I want the odds ratios and 95% credible intervals to be calculated using the "B" treatment as the reference group when comparing "B" with "A" and "C" as the reference group when comparing "A" with "C" and "B" with "C". Below is the code I have tried:
library(gemtc)
library(rjags)
# Create mtc.network element to be used in modeling ------
network <- mtc.network(data.ab = df)
# Compile model ------
network.mod <- mtc.model(network,
linearModel = "random", # random effects model
n.chain = 4) # 4 Markov chains
# Assess network consistency using nodesplit method ------
nodesplit <- mtc.nodesplit(network.mod,
linearModel = "random", # random effects model
n.adapt = 5000, # burn-in iterations
n.iter = 100000, # actual simulation iterations
thin = 10) # extract values of every 10th iteration
summary(nodesplit) # High p-values indicate consistent results
plot(summary(nodesplit))
My results provide ORs (95% CrIs) for:
"A" vs. "C"
"B" vs. "C"
"B" vs. "A"
I have created a separate data frame specifying that I want "A" vs. "B" comparisons via:
# Specify desired comparisons ------
comparisons = data.frame(t1 = "A", t2 = "B")
# Assess network consistency using nodesplit method, adding comparisons argument ------
nodesplit <- mtc.nodesplit(network.mod,
comparisons = comparisons,
linearModel = "random", # random effects model
n.adapt = 5000, # burn-in iterations
n.iter = 100000, # actual simulation iterations
thin = 10) # extract values of every 10th iteration
summary(nodesplit) # High p-values indicate consistent results
But I still get "B" vs. "A" results. I have also tried specify t1="B", t2="A", and I get the same results. Any assistance with this would be greatly appreciated. Thanks in advance.

rgl segments3d to connect 3d scatter points in order to plot a skeleton

I am working with motion capture data, and wish to plot two skeletons in 3D (motion capture data obtained from two different systems).
I have managed to plot and label the joints, but I canĀ“t figure out how to connect the joints with lines.
A short explanation to the abreviations used in the sample dataset below:
RA and LA (Right and Left Ankle)
RK and LK (Right and Left Knee)
RH and LH (Right and Left Hip)
CG (Center of Gravity)
Simplified data set:
df <- data.frame(
Joint = c("LA", "RA", "LK", "RK", "LH", "RH", "CG", "LA", "RA", "LK", "RK", "LH", "RH", "CG"),
system = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B"),
x = c(0, 10, 0, 10, 0, 10, 5, 0, 10, 0, 10, 0, 10, 5),
y = c(0,0,0,0,0,0,0, 20,20,20,20,20,20,20),
z = c(0, 0, 20, 20, 40, 40, 50, 0, 0, 20, 20, 40, 40, 50))
My code so far to plot and label the joints from the two systems:
library(rgl)
with(df, plot3d(x, y, z, type="s", col = as.numeric(system)))
with(df, text3d(x, y, z, text = Joint, adj = 2))
Can you help me connect the joints?
Use the segments3d function to draw line segments. It takes the usual
x, y, z coordinates, and joins pairs of points. So you'll need to work out which joints are joined, and plot segments between those joints.
If the joints are always in the order you gave, it would go something like this:
segs <- c(1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 7)
segments3d(df[segs, 3:5])
(This just does the system A segments.)
Edited to add: In response to the first comment: You will need to tell R that ankles connect to knees, etc, but you can do that. For example:
segs <- c()
for (s in unique(df$system)) {
seg <- with(df, c(which(system == s & Joint == "LA"),
which(system == s & Joint == "LK"))
if (length(seg) == 2)
segs <- c(segs, seg)
seg <- with(df, c(which(system == s & Joint == "LK"),
which(system == s & Joint == "CG"))
if (length(seg) == 2)
segs <- c(segs, seg)
# etc for the other side
}
segments3d(df[segs, 3:5])
This could all be compressed if you have the connections arranged in an R object somehow. I'll leave that to you to work out.

Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)

Dealing with ties in agricolae Kruskal test, R

I am running a kruskal.test on some non-normal data with the agricolae package. Some groups have exactly the same value as each other. The kruskal test doesn't handle this well, I receive the error Error in if (s) { : missing value where TRUE/FALSE needed. At first, I thought this was because all the values were 0, but when I make them all the same large number (to test), the same error appears and the function will stop (running function through a loop) and doesn't evaluate anything beyond the first tied variable.
Obviously there is no point running stats on these groups as there will be no difference, but I am using the information generated by agricolae:kruskal to produce a summary table and I need these variables included. I would prefer to keep using this package as it gives me a lot of valuable information. Is there anything I can do to make it run through the tied variables?
dput(example)
structure(list(TREATMENT = c("A", "A", "A", "B", "B", "C", "C",
"C", "D", "D"), W = c(0, 1.6941524646937, 1.524431531984, 0.959282869723864,
1.45273122733115, 0, 1.57479386520925, 0.421759202661462, 1.34235435984449,
1.52131484305823), X = c(0, 0.663872820198758, 0.202935807030853,
0.836223346381214, 0.750767193777965, 1.18128574225979, 2.03622986392828,
3.56466682539425, 0.919751117364462, 0.917347336682722), Y = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Z = c(2.1477548118197, 2.0111754022729,
3.14642815196242, 4.46967452127494, 1.53715421615569, 2.36274861406182,
2.33262528044302, 2.50970456594739, 2.96088598025103, 2.22841740590261
)), class = "data.frame", row.names = c(NA, 10L), .Names = c("TREATMENT",
"W", "X", "Y", "Z"))
library(agricolae)
example<-as.data.frame(example)
for(i in 2:(ncol(example))){
krusk <- kruskal(example[,i],TREATMENT,group=TRUE)
print(krusk)
}
for(i in 2:(ncol(example))){
if(var(example[,i]) > 0){
krusk <- kruskal(example[,i],example$TREATMENT,group=TRUE)
print(krusk)
}
}

Resources