Multiple barplot along with t-test - r

I want a barplot based on the number of occurrences of a string in a particular column in a dataset in r.
At the same time, I want to run a t-test and plot the significant p-values using stars on the top of the bars. The nonsignificant can be represented as ns.
My attempt has been:
barplot(prop.table(table(ttcluster_dataset$Phenotype)),col=clustercolor,border="black",xlab="Phenotypes",ylab="Percentage of Samples expressed",main="Sample wise Phenotype distribution",cex.names = 0.8)
The dataset column is:
ttcluster_dataset$Phenotype<-
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Proneural (Cluster 1)", "Proneural (Cluster 2)", "Neural (Cluster 1)", "Neural (Cluster 2)",
"Classical (Cluster 1)", "Classical (Cluster 2)", "Mesenchymal (Cluster 1)",
"Mesenchymal (Cluster 2)"), class = "factor")
All suggestions shall be apprciated.

A t-test is probably not what you want since you are looking at counts and proportions between the two clusters. Your data is not really set up to do either one so first we need to split the two variables:
Pheno.splt <- strsplit(as.character(ttcluster_dataset$Phenotype), " ")
Pheno.mat <- do.call(rbind, x)[, c(1, 3)]
ttclust <- data.frame(Phenotype=Pheno.mat[, 1], Cluster=gsub(")", "", Pheno.mat[, 2]))
str(ttclust)
# 'data.frame': 171 obs. of 2 variables:
# $ Phenotype: chr "Proneural" "Proneural" "Proneural" "Proneural" ...
# $ Cluster : chr "1" "1" "1" "1" ...
Now Phenotype and Cluster are separate columns in the data frame. There are multiple ways to do this, but here we just split your Phenotype into three parts by splitting on the space between them. Now ttclust is as data frame with two variables. Now a summary table and bar plot:
tbl <- xtabs(~Phenotype+Cluster, ttclust)
tbl
# Cluster
# Phenotype 1 2
# Classical 32 6
# Mesenchymal 44 10
# Neural 26 0
# Proneural 45 8
tbl.row <- prop.table(tbl, 1)
barplot(t(tbl.row), beside=TRUE)
At this point, a simple proportions test indicates that there is no difference in percent of Cluster 1 across the four Phenotypes:
prop.test(tbl)
4-sample test for equality of proportions without continuity correction
data: tbl
X-squared = 5.2908, df = 3, p-value = 0.1517
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.8421053 0.8148148 1.0000000 0.8490566
Using `prop.test' on each Phenotype indicates that Cluster 1 is significantly difference from Cluster 2 in every case:
for(i in 1:4) print(prop.test(t(tbl[i, ])))
# First test
#
# 1-sample proportions test with continuity correction
#
# data: t(tbl[i, ]), null probability 0.5
# X-squared = 16.447, df = 1, p-value = 5.002e-05
# alternative hypothesis: true p is not equal to 0.5
# 95 percent confidence interval:
# 0.6807208 0.9341311
# sample estimates:
# p
# 0.8421053
. . . .

Related

How to rename integers to factor based on key

I need to make new column of factors based on value of column Quadrat. There are 9 quadrats, and new column called Sponge would be something like:
"Old Growth" if Quadrat = 1,4,9
"Absent" if Quadrat= 3,6,7
"New Growth" if Quadrat = 2,5,8
I am sorry if answer is easy, I did check: How to convert integer to factor in R?
and also I am trying to use recode_factor. Here is my code:
library(dplyr)
key <- list(`1,4,9` = "Old Growth", `3,6,7` = "Absent", `2,5,8` = "New Growth")
df <- mutate(df, Sponge = recode_factor(Quadrat, key))
I get error:
Error in mutate_impl(.data, dots) :
Evaluation error: Vector 1 must be length 108 or one, not 3.
Real data has much more entries than the dataset I include here, if that matters. Thank you for any help.
df <- structure(list(Quadrat = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L,
9L, 9L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L,
5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L,
7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L,
9L, 9L, 9L), Month = structure(c(4L, 4L, 4L, 3L, 3L, 3L, 7L,
7L, 7L, 1L, 1L, 1L, 8L, 8L, 8L, 6L, 6L, 6L, 5L, 5L, 5L, 2L, 2L,
2L, 9L, 9L, 9L, 4L, 4L, 4L, 3L, 3L, 3L, 7L, 7L, 7L, 1L, 1L, 1L,
8L, 8L, 8L, 6L, 6L, 6L, 5L, 5L, 5L, 2L, 2L, 2L, 9L, 9L, 9L, 4L,
4L, 4L, 3L, 3L, 3L, 7L, 7L, 7L, 1L, 1L, 1L, 8L, 8L, 8L, 6L, 6L,
6L, 5L, 5L, 5L, 2L, 2L, 2L, 9L, 9L, 9L, 4L, 4L, 4L, 3L, 3L, 3L,
7L, 7L, 7L, 1L, 1L, 1L, 8L, 8L, 8L, 6L, 6L, 6L, 5L, 5L, 5L, 2L,
2L, 2L, 9L, 9L, 9L), .Label = c("Apr", "Aug", "Feb", "Jan", "Jul",
"Jun", "Mar", "May", "Sep"), class = "factor"), PopDens = c(65.6011820777785,
18.4913752602879, 12.151802276494, 68.0740840677172, 50.9832500135526,
36.8684287818614, 52.0825074084569, 26.8776902493555, 49.2173263626173,
25.5460870559327, 5.4171769618988, 34.4303709487431, 44.3439512783661,
2.25230997451581, 61.2502326716203, 25.9035727053415, 32.339118222706,
24.1017888628412, 12.340617884649, 53.3521768709179, 26.0048255382571,
52.8581868957262, 31.9503199581522, 18.1601244299673, 34.228305231547,
2.09199664392509, 22.6402857622597, 4.48008164577186, 48.2082461479586,
65.4937081446406, 5.43837511213496, 32.8203339113388, 4.44421968702227,
19.8568186087068, 24.2561273102183, 12.3652934685815, 39.0541164302267,
16.1970243314281, 12.9826903613284, 36.3537323835772, 48.7148000504822,
11.5067498446442, 68.7493303583469, 60.7505214684643, 49.3874175737146,
63.0705459746532, 23.721419940237, 53.4379795142449, 57.7867246468086,
38.4747762591578, 8.43540686019696, 20.5636212413665, 28.7687741059344,
53.2144687068649, 32.0859562589321, 10.5120962983929, 53.4312571119517,
13.6547974413261, 31.3038802060764, 14.5005466006696, 6.03453303268179,
62.6867637028918, 17.7734197168611, 11.0327071261127, 51.4377708046231,
26.8335341704078, 9.81126144807786, 43.993699422339, 20.5123583010864,
14.9305799969006, 23.8019575944636, 39.1543961388525, 30.4534046472982,
61.2751477411948, 48.0770866076928, 59.4514226955362, 42.9857548968866,
23.0139948409051, 1.76873184926808, 33.1222371393815, 10.8652087603696,
24.5235243474599, 62.4086231633555, 55.6522683221847, 68.8337469024118,
48.2195318546146, 6.75986870843917, 57.7931131315418, 18.2255988919642,
40.8185531077906, 38.066848333925, 31.8611310839187, 22.2724406518973,
51.7982920755167, 29.2363496678881, 35.541056742426, 66.5265460675582,
28.267403066624, 40.5209824540652, 31.8187582066748, 67.2972998009063,
53.6718824433628, 42.6495425191242, 31.6603209995665, 44.3039192620199,
21.6216275517363, 66.9763269643299, 36.3314134527463)), .Names = c("Quadrat",
"Month", "PopDens"), row.names = c(NA, -108L), class = "data.frame")
If we are using recode_factor, then create the list with individual components instead of pasteed one
key <- setNames(as.list(rep(c("Old Growth", "Absent", "New Growth"),
each = 3)), c(1, 4, 9, 3, 6, 7, 2, 5, 8))
df %>%
mutate(Sponge = recode_factor(Quadrat, !!! key)) %>%
head
# Quadrat Month PopDens Sponge
#1 1 Jan 65.60118 Old Growth
#2 1 Jan 18.49138 Old Growth
#3 1 Jan 12.15180 Old Growth
#4 2 Feb 68.07408 New Growth
#5 2 Feb 50.98325 New Growth
#6 2 Feb 36.86843 New Growth
Use mutate with the factor function
df %>% mutate(Quadrat2 =
factor(Quadrat, levels = 1:9,
labels =rep(c("Old Growth", "New Growth", "Absent"),3)
)
)

Sankey diagram, alluvial, ggalluvial in R – Three data blocks: Baseline-Flow (many time points)-Outcome

We would like to present the change in muscle mass due to the exercise of different age group and the final performance/outcome at the competition at the end of the study.
We have several time points at which the muscle mass was measured. In this example I only show three time points, however, the study compromises 12 time points.
To present the change in muscle mass and deviation from the average I was able to use geom_flow(). However, it becomes very tricky to add the age groups on the left of the chart as well as the performance on the right side. These data are located in different variables.
Please help us to find a great way to present the data. Thanks.
Data Structure:
ID Age_at_start Month Deviation_muscle Performance
1 36 3 59 Outstanding
1 36 6 104 Outstanding
1 36 9 200 Outstanding
2 29 3 -40 average
2 29 6 -109 average
2 29 9 -30 average
3 22 3 310 above average
library(ggplot2)
library(ggalluvial)
df.san$age<-factor(df.san$age)
df.san$age<-factor(df.san$age, levels=c(1,2,3,4), labels=c("20 to 24 years","25 to 29 years","30 to 34 years","35 to 39 years"))
df.san$dev_group <-factor(df.san$dev_group,levels=c(1,2,3,4,5,6,7),labels=c("≥250g","≥150 to <250g","≥50 to <150g","> -50 to <50g","> -150 to ≤ -50","> -250 to ≤ -150", "≤ -250g"))
df.san$month <- factor(df.san$month,labels=c("1mo","2mo","3mo"))
df.san$perform<-factor(df.san$perform,levels=c(1,2,3,4),labels=c("outstanding "," above average "," average "," below average"))
ggplot(df.san,aes(x = month,stratum = dev_group, alluvium = ID, fill = dev_group,label = dev_group)) +
scale_fill_brewer(type = "qual", palette = "Set2") +
geom_flow(stat = "alluvium", lode.guidance = "rightleft", color = "darkgray") +
geom_stratum() +
theme(legend.position = "bottom") +
ggtitle("Effect of Exercice on Muscle Growth on Performance in 4 Different Age Groups ")
Data for df.san:
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L), age = c(2L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 4L, 1L, 1L, 3L, 1L, 4L, 4L, 3L, 4L, 3L, 4L, 2L, 2L, 1L, 2L, 4L, 1L, 1L, 4L, 1L, 3L, 1L, 2L, 3L, 4L, 4L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 4L, 3L, 3L, 2L), month = c(2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L, 2L, 4L, 6L), dev_muscle = c(-109.3, -236.2, -275.4, -44.5, -202.6, -436, 3, -115.8, -136.2, -142.1, -429, -561.4, -49, -248.8, -232.6, -15.9, -171.5, -391.6, -5.8, -21.7, -104.1, 12.6, -33.4, -25.4, -57.3, -50.7, -103.6, -124, -221.4, -457.2, 22.1, -126.9, -79.5, -76.8, -113.2, -129.7, -86.1, -126, -82.9, -10.8, -2.8, 88.3, 41.6, 0.2, 184.7), perform = c(1L, 2L, 1L, 2L, 4L, 1L, 1L, 4L, 3L, 4L, 2L, 4L, 4L, 4L, 2L, 2L, 4L, 3L, 3L, 4L, 1L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 2L, 1L, 2L, 4L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 4L, 4L), dev_group = c(5L, 6L, 7L, 4L, 6L, 7L, 4L, 5L, 5L, 5L, 7L, 7L, 4L, 6L, 6L, 4L, 6L, 7L, 4L, 4L, 5L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 7L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 3L, 4L, 4L, 2L)), class = "data.frame", row.names = c(NA, -45L))

How to construct a matrix for a heatmap or a contour plot, but with NA events?

How can I construct a heatmap like matrix from 3 variables, 2 categorical and 1 numeric, in which certain events do not occur. My dplyr code overlooks those events and misses about 20 cavities in the surface plot that I'd like to make. For that I need an accurate matrix. But this is rather complicated.
What I consider a NA event is a maximum time for which two categorical events (Modeling and Discourse) do not occur simultaneously. So a point of null time observations (NA), not even zero.
I have the following dataframe:
df <- structure(list(`Modeling Code` = structure(c(4L, 4L, 4L, 4L,
4L, 4L, 4L, 6L, 4L, 5L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L,
2L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L,
6L, 6L, 6L, 6L, 6L, 4L, 5L, 5L, 5L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 3L, 3L, 5L, 4L, 4L, 4L,
4L, 5L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
4L, 5L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 6L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 2L, 2L,
2L, 5L, 4L, 4L, 2L, 2L, 5L, 2L, 2L, 3L, 5L, 5L, 5L, 4L, 4L, 1L,
1L, 4L, 4L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 6L, 5L, 5L, 2L, 5L, 5L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L, 5L, 5L,
5L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 3L, 2L, 2L, 2L, 2L, 2L,
5L, 5L, 5L, 3L, 3L, 3L, 3L, 6L, 6L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 5L, 5L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 3L, 3L, 3L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 6L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 6L, 2L, 6L, 2L, 6L, 6L, 6L, 6L, 2L, 2L, 2L,
2L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 3L,
3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 6L, 6L, 6L, 6L, 6L,
6L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 1L, 1L, 1L, 3L, 3L, 1L), .Label = c("A",
"MA", "OFF", "P", "SM", "V"), class = "factor"), `Discourse Code` = structure(c(8L,
5L, 8L, 1L, 9L, 2L, 8L, 6L, 5L, 6L, 5L, 8L, 3L, 3L, 6L, 2L, 2L,
9L, 3L, 3L, 6L, 6L, 3L, 3L, 8L, 6L, 9L, 3L, 3L, 9L, 8L, 6L, 8L,
6L, 9L, 3L, 3L, 6L, 6L, 4L, 9L, 1L, 6L, 9L, 6L, 3L, 3L, 6L, 8L,
2L, 6L, 2L, 8L, 2L, 2L, 2L, 2L, 8L, 2L, 1L, 6L, 8L, 9L, 2L, 6L,
8L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 9L, 1L, 6L, 8L, 7L, 7L, 6L,
8L, 6L, 9L, 9L, 6L, 1L, 1L, 6L, 6L, 9L, 9L, 1L, 1L, 9L, 6L, 6L,
6L, 1L, 1L, 9L, 6L, 9L, 1L, 6L, 1L, 9L, 9L, 1L, 6L, 1L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 6L, 9L, 6L, 9L, 8L, 2L, 8L, 2L, 1L, 2L, 6L,
4L, 1L, 1L, 1L, 9L, 5L, 1L, 9L, 8L, 2L, 9L, 2L, 7L, 6L, 1L, 6L,
1L, 2L, 6L, 6L, 6L, 9L, 2L, 2L, 9L, 7L, 7L, 7L, 7L, 9L, 2L, 1L,
1L, 4L, 8L, 4L, 6L, 1L, 6L, 9L, 2L, 1L, 9L, 6L, 6L, 9L, 1L, 6L,
2L, 4L, 4L, 4L, 4L, 8L, 6L, 2L, 1L, 1L, 1L, 2L, 6L, 6L, 8L, 2L,
4L, 6L, 9L, 1L, 6L, 1L, 1L, 3L, 2L, 2L, 2L, 9L, 9L, 9L, 8L, 2L,
6L, 1L, 2L, 1L, 2L, 2L, 1L, 8L, 2L, 6L, 6L, 8L, 2L, 7L, 2L, 2L,
6L, 2L, 2L, 6L, 4L, 8L, 7L, 7L, 7L, 7L, 6L, 8L, 7L, 7L, 9L, 1L,
9L, 2L, 9L, 1L, 6L, 9L, 2L, 6L, 2L, 7L, 9L, 8L, 9L, 9L, 2L, 8L,
9L, 4L, 2L, 4L, 6L, 2L, 6L, 1L, 1L, 3L, 9L, 1L, 8L, 9L, 9L, 9L,
6L, 2L, 6L, 2L, 2L, 7L, 7L, 7L, 8L, 1L, 2L, 2L, 2L, 2L, 6L, 8L,
6L, 1L, 6L, 8L, 2L, 1L, 2L, 6L, 9L, 2L, 9L, 2L, 6L, 2L, 1L, 1L,
9L, 9L, 9L, 8L, 4L, 9L, 6L, 1L, 2L, 9L, 8L, 2L, 1L, 6L, 1L, 6L,
2L, 8L, 2L, 2L, 8L, 4L, 4L, 9L, 6L, 1L, 9L, 7L, 7L, 7L, 7L, 7L,
9L, 6L, 7L, 7L, 7L, 7L, 8L, 6L, 2L, 2L, 6L, 8L, 8L, 4L, 2L, 6L,
1L, 6L, 9L, 6L, 9L, 9L, 2L, 8L, 6L, 6L, 2L, 2L, 9L, 9L, 6L, 2L,
2L, 3L, 3L, 3L, 2L, 9L, 2L, 9L, 2L, 9L, 1L, 9L, 8L, 6L, 7L, 7L,
6L), .Label = c("AG", "C", "D", "DA", "G", "J", "OFF", "Q", "S"
), class = "factor"), Time_Processed = c(1.3833, 1.4333, 1.4667,
1.5333, 1.6167, 1.65, 1.6833, 1.7333, 1.8, 1.8667, 1.9833, 2.05,
2.1333, 2.1667, 2.2167, 2.3, 2.3167, 2.3667, 2.5667, 2.5833,
2.6, 2.7833, 2.8, 2.8167, 2.8667, 3.0167, 3.0333, 3.05, 3.05,
3.1, 3.1833, 3.2667, 3.3, 3.3333, 3.4167, 3.45, 3.4833, 3.5667,
3.6, 3.7, 3.7167, 3.8, 3.95, 4, 4.05, 4.15, 4.1667, 4.15, 4.2167,
4.3, 4.3833, 4.4, 4.4833, 4.5833, 4.6, 4.7, 4.8, 4.8333, 4.8833,
5, 5.05, 5.1, 5.2167, 5.4333, 5.45, 5.6, 5.7, 5.9167, 6.25, 6.2667,
6.2833, 6.4667, 6.5167, 6.5333, 6.55, 6.6667, 6.7167, 6.9, 6.95,
7.05, 7.05, 7.45, 7.6167, 7.7667, 7.7833, 7.8333, 8, 8.0167,
8.05, 8.1, 8.2833, 8.3167, 8.4333, 8.4667, 8.5, 8.55, 8.8833,
9.2667, 9.3167, 9.3333, 9.35, 9.5167, 9.6833, 9.7167, 9.7667,
9.7833, 9.8333, 9.9, 9.9667, 10.0667, 10.0833, 10.15, 10.2, 10.2667,
10.2667, 10.3, 10.35, 10.3667, 10.4, 10.7, 10.7833, 10.9, 11.1333,
11.1833, 11.2167, 11.2333, 11.25, 11.3, 11.35, 11.4167, 11.4667,
11.5333, 11.5667, 11.6667, 11.85, 11.8667, 11.8833, 12.25, 12.3167,
12.7167, 12.7333, 12.8, 12.85, 12.9333, 12.9667, 13.2667, 13.3167,
13.4, 13.4167, 13.5, 13.55, 13.6333, 13.9, 13.95, 13.9667, 14.05,
14.0833, 14.3167, 14.35, 14.3667, 14.4333, 14.4667, 14.5, 14.5333,
14.5833, 14.5833, 14.6167, 14.6667, 14.7167, 14.75, 14.7667,
15.05, 15.0833, 15.25, 15.4333, 15.4833, 15.5167, 15.6, 15.6333,
15.7167, 15.7333, 15.7667, 15.8667, 16.0167, 16.2, 16.2833, 16.3333,
16.3833, 16.45, 16.6, 16.6667, 16.9333, 16.9667, 17, 17.0333,
17.0833, 17.1167, 17.2167, 17.35, 17.4333, 17.55, 17.6, 17.6167,
17.65, 17.7, 17.7167, 17.75, 17.7833, 17.8833, 17.9333, 17.9833,
18.0167, 18.0333, 18.05, 18.0667, 18.1, 18.1667, 18.2, 18.3667,
18.45, 18.5333, 18.6333, 18.6667, 18.7333, 18.85, 18.8833, 18.9833,
19.0333, 19.0667, 19.3833, 19.5333, 19.6333, 19.6667, 19.7167,
19.9333, 19.9667, 20.05, 20.2333, 20.3667, 20.4333, 20.5, 20.5167,
20.5167, 20.55, 20.6167, 20.7167, 20.7667, 20.8167, 20.8667,
21.1333, 21.1833, 21.2, 21.2167, 21.2333, 21.2833, 21.3, 21.5,
21.5833, 21.6333, 21.6667, 21.6833, 21.6833, 21.8167, 21.8833,
22.1333, 22.1667, 22.35, 22.4333, 22.5, 22.5333, 22.5833, 22.6,
22.6, 22.65, 22.6667, 22.7167, 22.75, 22.8833, 23.0667, 23.0833,
23.1167, 23.3167, 23.35, 23.3667, 23.45, 23.5, 23.7667, 23.9833,
24.1833, 24.2167, 24.25, 24.2833, 24.5167, 24.5333, 24.6833,
24.7833, 24.7833, 24.8, 24.8, 24.8667, 25.3833, 25.4333, 25.4833,
25.5, 25.5167, 25.55, 25.5667, 25.5833, 25.6667, 25.7, 26, 26.1333,
26.1667, 26.2, 26.2333, 26.2667, 26.4, 26.4333, 26.4667, 26.5,
26.5167, 26.6667, 26.7, 26.8, 27.0833, 27.1833, 27.2, 27.2, 27.45,
27.5667, 27.6667, 27.7, 27.75, 27.7667, 27.7667, 27.8, 27.8333,
28.0333, 28.35, 28.6333, 28.6333, 28.7833, 28.8, 28.85, 29, 29.1833,
29.3333, 29.6667, 29.7333, 29.8, 29.8833, 29.9, 29.9333, 30.0667,
30.1, 30.1833, 30.2167, 30.25, 30.3, 30.3833, 30.5, 30.55, 30.7167,
31.0167, 31.45, 31.6, 31.8, 31.8333, 32.0167, 32.15, 32.15, 32.1667,
32.2167, 32.2167, 32.2333, 32.3833, 32.6167, 32.6667, 32.7, 32.7167,
32.7333, 32.75, 32.9, 33.0833, 33.1333, 33.1833)), row.names = c(NA,
-386L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Modeling Code",
"Discourse Code", "Time_Processed"))
Looks a little bit like this:
df[1:10,]
# A tibble: 10 x 3
`Modeling Code` `Discourse Code` Time_Processed
<fct> <fct> <dbl>
1 P Q 1.38
2 P G 1.43
3 P Q 1.47
4 P AG 1.53
5 P S 1.62
6 P C 1.65
7 P Q 1.68
8 V J 1.73
9 P G 1.80
10 SM J 1.87
If I construct a matrix for my heatmap For the two categorical variables Modeling Code and Discourse Code, it looks a little bit like this:
with(df, table(`Discourse Code`, `Modeling Code`)) %>% prop.table() %>% as.data.frame() -> z
ggplot(data = z, aes(x = `Modeling.Code`, y = `Discourse.Code`, fill = Freq)) + theme_bw() + geom_tile() + geom_text(size = 3, aes(label = Freq))
This is a heatmap of the freqency of occurence of each matching categorical varibale so (C & MA) occur simutaneously about 10.6% of the time, while many pairs of categorical factors do not sumulatenously occur at all. These are the ones with 0 quantity. All those factors add up to 1, accounting for 100% of all pairs of Modeling and Discourse Codes.
If you count the number of zeroes (no occurring pairs) in this data-set you will see that there are twenty zeroes and this is important.
I was interested in the times at which these pairs occur so I decided to make a contour plot with plot_ly from my original dataset.
plot_ly(data = df, x = ~ `Modeling Code`, y = ~ `Discourse Code`, z = ~ `Time_Processed`, type = "contour")
Inspection of this contour plot with an interactive mouse shows that the Time points of "Time_Processed" are the maximum values of the "Modeling Codes" and "Discourse Codes"
So I generate those points with dplyr:
df %>%
+ group_by(`Modeling Code`, `Discourse Code`) %>%
+ summarise(max_time = max(Time_Processed))
# A tibble: 34 x 3
# Groups: Modeling Code [?]
`Modeling Code` `Discourse Code` max_time
<fct> <fct> <dbl>
1 A AG 9.97
2 A C 32.7
3 A D 4.17
4 A J 33.2
5 A Q 32.8
6 A S 32.7
7 MA AG 24.7
8 MA C 31.4
9 MA D 22.4
10 MA DA 27.2
# ... with 24 more rows
Hold up!!! There are only 34 entries, of maximum times, but the size of my heatmap is (6 x 9) = 54 cells. The 20 missing entries are the categorical pairs that yield zero. So I'm finding it very difficult to construct my matrix.
A MA OFF P SM V
S 32.733 31.800 NA 30.3000 30.250 32.700
Q 32.750 27.1833 NA 30.5000 29.800 28.85
OFF NA NA 33.133 NA NA NA
J 33.1833 26.5167 NA 30.7167 30.2167 31.8333
G NA NA NA 11.8500 NA NA
DA NA 20.72 NA NA 29.8833 25.700
D 4.1667 22.235 NA 6.2667 NA 32.2167
C 32.6667 31.4500 NA 30.3833 29.9000 32.1500
AG 9.967 24.6833 NA 13.2667 30.0667 32.7167
This is the matrix (assuming I didn't make any manual mistakes) that I'd like to create based on my observations. The NAs are values that for the Modeling and Discourse Code pairs that do not occur, so it's the 20 entries that my dplyr summarise function with maximum time could not capture, but my heatmap did. So if I do that then I can tediously fill out this matrix.
My question is how can I construct this matrix?
In addition, I would prefer that the matching values either show up as NAs or as -1, but not zero ... because my goal is to construct this matrix and then I can create a 3D surface plot that complements by contour plot so that I can accurately see the types of procedures that my subjects are implementing over an event that is about 30 minutes. So if those drop columns are interpreted as zero, then the surface plot will be wrong because at the beginning of the event (time 0) the subjects did not use those procedures.
Complex problems sometimes have simple solutions and it wasn't clear to me until I did a lot of experimentation with all existing functions. I figured out that dcast accomplished my goal. All the word noise was me trying to explain the complexity of my problem I was hoping you would understand.
dcast(data = FERMI_1, formula = `Discourse Code` ~ `Modeling Code`, value.var = "Time_Processed", fun.aggregate = max, fill = -1)
Discourse Code A MA OFF P SM V
1 AG 9.9667 24.6833 -1.0000 13.2667 30.0667 32.7167
2 C 32.6667 31.4500 -1.0000 30.3833 29.9000 32.1500
3 D 4.1667 22.3500 -1.0000 6.2667 -1.0000 32.2167
4 DA -1.0000 27.2000 -1.0000 -1.0000 29.8833 25.7000
5 G -1.0000 -1.0000 -1.0000 11.8500 -1.0000 -1.0000
6 J 33.1833 26.5167 -1.0000 30.7167 30.2167 31.8333
7 OFF -1.0000 -1.0000 33.1333 -1.0000 -1.0000 -1.0000
8 Q 32.7500 27.1833 -1.0000 30.5000 29.8000 28.8500
9 S 32.7333 31.8000 -1.0000 30.3000 30.2500 32.7000
It appears my comment answered the question:
If you have an object that supports the is.na and [<- functions then reassigning a numeric value of -1 to entries that currently are NA is as simple as obj[ is.na(obj) ] <- -1. (I cannot really tell if this is the request, since I got lost in the long presentation that didn't have a definite goal.) If on the other hand, the need is to first generate such a matrix from a long format data-obj named df2 might be addressed by
obj <- xtabs(max_time ~Modeling Code+Discourse Code, data=df2)

Stacked bar graph with fill ggplot2

I've read through the ggplot2 docs website and other question but I couldn't find a solution. I'm trying to visualize some data for varying age groups. I have sort of managed to do it but it does not look like I would intend it to.
Here is the code for my plot
p <- ggplot(suggestion, aes(interaction(Age,variable), value, color = Age, fill = factor(variable), group = Age))
p + geom_bar(stat = "identity")+
facet_grid(.~Age)![The facetting separates the age variables][1]
My ultimate goal is to created a stack bar graph, which is why I used the fill, but it does not put the TDX values in its corresponding Age group and Year. (Sometimes TDX values == DX values, but I want to visualize when they don't)
Here's the dput(suggestion)
structure(list(Age = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L,
7L), .Label = c("0-2", "3-9", "10-19", "20-39", "40-59", "60-64",
"65+", "UNSP", "(all)"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Year.10.DX", "Year.11.DX",
"Year.12.DX", "Year.13.DX", "Year.10.TDX", "Year.11.TDX", "Year.12.TDX",
"Year.13.TDX"), class = "factor"), value = c(26.8648932910636,
30.487741796656, 31.9938838749782, 62.8189679326958, 72.8480838120064,
69.3044125928752, 36.9789457527416, 21.808001825378, 24.1073451428435,
40.3305134762935, 70.4486116545885, 68.8342676191755, 63.9227718107745,
34.6086468618636, 8.84033719571875, 13.2807072303835, 28.4781516422802,
55.139497471546, 59.7230544500003, 67.9448927372699, 37.7293286937066,
6.9507024051526, 17.4393054963572, 33.1485743479821, 61.198647580693,
58.6845873573852, 48.0073013177248, 28.4455801248562, 26.8648932910636,
19.8044453272475, 23.0189084635948, 53.7037832071889, 60.6516550126422,
58.1573725886767, 27.0791868812255, 21.808001825378, 19.8146296425633,
35.0587750051557, 62.3308555053346, 59.3299998610862, 56.5341245769817,
27.7229319271878, 8.84033719571875, 13.2807072303835, 22.4081606349585,
48.0252683906252, 52.7560684009579, 65.2890977685045, 32.4142337849399,
6.9507024051526, 15.2833655677215, 24.5268503180754, 52.536784326675,
51.4100599515986, 40.9609231655724, 18.1306673637441)), row.names = c(NA,
-56L), .Names = c("Age", "variable", "value"), class = "data.frame")
It's unclear what you need but perhaps this.
ggplot(a,aes(x=variable,y=value,fill=Age)) + geom_bar(stat='identity')
+facet_wrap(~Age)
If you want to visualize separately the TDX and the DX entries, we'll need to change the dataframe a bit.
> head(a)
Age variable value
1 0-2 Year.10.DX 26.86489
2 3-9 Year.10.DX 30.48774
3 10-19 Year.10.DX 31.99388
4 20-39 Year.10.DX 62.81897
5 40-59 Year.10.DX 72.84808
6 60-64 Year.10.DX 69.30441
The column of interest variable is a combination of year and of TDX/DX value. We'll use the tidyr package to separate this into two columns.
library(tidyr)
library(dplyr)
tidy_a<- a %>% separate(variable, into = c( 'nothing',"year",'label'), sep = "\\.")
This actually splits the levels of column variable into three components, since we split on . and the character . appears twice in each entry.
> head(tidy_a)
Age nothing year label value
1 0-2 Year 10 DX 26.86489
2 3-9 Year 10 DX 30.48774
3 10-19 Year 10 DX 31.99388
4 20-39 Year 10 DX 62.81897
5 40-59 Year 10 DX 72.84808
6 60-64 Year 10 DX 69.30441
So the column nothing is rather useless, just a necessary result of using separate and separating on .. Now this will allow us to visualize TDX/DX separately.
ggplot(tidy_a,aes(x=year,y=value,fill=label)) + geom_bar(stat='identity') + facet_wrap(~Age)

Create dataframe from logical test

I've got what I know is a really easy question, but I'm stumped and seem to lack the vocabulary to seek out the answer effectively with the search bar.
I have a data frame full of numbers similar to this (though not of the same class)
Dat <- structure(c(9L, 9L, 3L, 3L, 2L, 9L, 10L, 5L, 6L, 2L, 4L, 6L,
10L, 2L, 9L, 0L, 1L, 8L, 9L, 7L, 7L, 4L, 4L, 3L, 4L, 7L, 7L,
1L, 0L, 3L, 6L, 10L, 8L, 3L, 0L, 7L, 7L, 1L, 2L, 8L, 5L, 7L,
7L, 8L, 2L, 1L, 10L, 3L, 0L, 2L, 7L, 0L, 0L, 7L, 9L, 8L, 9L,
0L, 4L, 4L, 5L, 6L, 6L, 2L, 4L, 1L, 6L, 2L, 4L, 7L, 5L, 2L, 7L,
4L, 8L, 3L, 3L, 2L, 5L, 1L, 1L, 3L, 8L, 0L, 1L, 8L, 8L, 1L, 1L,
0L, 4L, 4L, 4L, 5L, 6L, 9L, 5L, 2L, 6L, 3L), .Dim = c(10L, 10L
))
All I want to do is replace all values > 5 with a 1, and all values less than 5 with a 0. I've gotten as far as getting a frame with TRUE and FALSE, but can't seem to figure out how to replace things.
Datlog <- Dat > 5
Any help would be greatly appreciated. Thank you.
If I read your question correctly, you'll kick yourself for the answer:
(Dat > 5) * 1
TRUE and FALSE in R equate to 1 and 0 respectively. As such, the more semantically correct way to do this would be something like:
out <- as.numeric(Dat > 5)
dim(out) <- dim(Dat)
The two step approach is required in this second approach because when you use as.numeric, the dims of the original data are lost.
One way to replace with different values would be to use factor:
out <- factor((Dat > 5), c(TRUE, FALSE), c("YES", "NO"))
dim(out) <- dim(Dat)
Another way would be basic subsetting and substitution:
out <- Dat
out[out > 5] <- 999
out[out <= 5] <- 0
out

Resources