Creating boxplot based on some conditions - r

Data given are a sample of cholesterol levels taken from 24 hospital employees who were on a standard American diet and who agreed to adopt a vegetarian diet for 1 month. Serum-cholesterol measurements were made before adopting the diet and 1 month after.
Subject Before After Difference
1 1 195 146 49
2 2 145 155 -10
3 3 205 178 27
4 4 159 146 13
5 5 244 208 36
6 6 166 147 19
7 7 250 202 48
8 8 236 215 21
9 9 192 184 8
10 10 224 208 16
11 11 238 206 32
12 12 197 169 28
13 13 169 182 -13
14 14 158 127 31
15 15 151 149 2
16 16 197 178 19
17 17 180 161 19
18 18 222 187 35
19 19 168 176 -8
20 20 168 145 23
21 21 167 154 13
22 22 161 153 8
23 23 178 137 41
24 24 137 125 12
Now here is the question I am trying to answer. Some investigators believe that the effects of diet
on cholesterol are more evident in people with high rather than low cholesterol levels. If you split the data  according to whether baseline cholesterol is above or below the median, can you comment descriptively on this issue?
Now, I am thinking of creating boxplot based on two categories here. I wish to use dplyr for data manipulation here. So, I will create a new column based on if Before is less than or greater than median of Before. So, I will have a new character vector with "high" for high Before cholesterol and low for low Before cholesterol. And, then I will do a boxplot of Difference based on the categorical new column. So, here is my code. I call the original data set as df2.
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
group_by(new_col) %>%
ggplot(aes(x= new_col, y=Difference)) +
geom_boxplot()
And following is the boxplot I get
So, based on this, I conclude that investigators are right and effects of diet on cholesterol are more evident in people with high rather than low cholesterol levels. I want to know if this can be done more effectively.

This is more a statistical plan question rather than a programming question, therefore it would belong more to stats.stackexchange than StackOverflow.
Anyway, categorizing a variable depending on the median is not the recommended way of visualizing associations, as you are suppressing a lot of information. You can read about this in this very good article by Peter Flom.
It is better to keep all the points and apply some spline or smoothing algorithm.
For instance, you could consider something like this:
ggplot(df2, aes(x= Before, y=Difference)) +
geom_point() +
geom_smooth()
Here, the relationship is clearly seeable, while keeping all the information you want.
If you really have to generate subgroups, you could also try something like this:
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
ggplot(aes(x= Before, y=Difference, group=new_col, color=new_col)) +
geom_point() +
geom_smooth(span=3) #try some other values here
However, using the median is still not a very good idea, especially with that amount of data points. You might want to assess the functional form of the relationship, but that would need a specific question on stats.stackexchange.com.

not really an answer, but more of a different approach in visualisation of the data..
library( data.table )
library( ggplot2 )
DT.melt <- melt( DT, id.vars = "Subject", measure.vars = c( "Before", "After" ) )
ggplot() +
geom_line( data = DT.melt,
aes( x = variable, y = value, group = Subject ) ) +
geom_line( data = DT.melt[, .(mean = mean(value)), by = variable ],
aes( x = variable, y = mean, group = 1 ), color = "red", size = 2 ) +
labs( x = "", y = "" )
sample data used
DT <- fread(" Subject Before After Difference
1 195 146 49
2 145 155 -10
3 205 178 27
4 159 146 13
5 244 208 36
6 166 147 19
7 250 202 48
8 236 215 21
9 192 184 8
10 224 208 16
11 238 206 32
12 197 169 28
13 169 182 -13
14 158 127 31
15 151 149 2
16 197 178 19
17 180 161 19
18 222 187 35
19 168 176 -8
20 168 145 23
21 167 154 13
22 161 153 8
23 178 137 41
24 137 125 12")

Related

Script out of bounds in R

I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]
Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.

Error in VAR model

I have this data:
Year W L PTS GF GA S SA
1 2006 49 25 106 253 224 2380 2662
2 2007 51 23 110 266 207 2261 2553
3 2008 41 32 91 227 224 2425 2433
4 2009 40 34 88 207 228 2375 2398
5 2010 47 29 100 217 221 2508 2389
6 2011 44 27 99 213 190 2362 2506
7 2012 48 26 104 232 205 2261 2517
8 2014 38 32 88 214 233 2382 2365
9 2015 47 25 104 226 202 2614 2304
10 2016 41 27 96 224 213 2507 2231
11 2017 41 29 94 238 220 2557 2458
12 2018 53 18 117 261 204 2641 2650
I've built a VAR model from this data (it's hockey data for one team for the listed years). I converted the above into a time series the ts() argument, and created this model:
VARselect(NSH_ts[, 3:5], lag.max = 8)
var1 <- VAR(NSH_ts[, 3:5], p = 2, type = "both", ic = c("AIC"))
serial.test(var1, type = "PT.adjusted")
forecast.var1 <- forecast(var1, h = 2)
autoplot(forecast.var1) +
scale_x_continuous(breaks = seq(2006, 2022))
I want to use the serial.test() argument, but I get this error:
Error in t(Ci) %*% C0inv : non-conformable arguments
Why won't the serial.test() argument work? (Overall I'm trying to forecast PTS for the next two years, based on the variables in the set).
I've been using this as a guide: https://otexts.org/fpp2/VAR.html
I'm getting a different error, which may be from the VARselect. My table is mostly -Inf entries, with one NaN, and the rest 0. Adjusting the lag.max gave me real numbers, and I had to adjust the other values as well.
VARselect(dfVAR[, 3:5], lag.max = 2)
var1 <- VAR(dfVAR[, 3:5], p = 1, type = "both", ic = c("AIC"))
serial.test(var1, lags.pt = 4, type = "PT.adjusted")
Portmanteau Test (adjusted)
data: Residuals of VAR object var1
Chi-squared = 35.117, df = 27, p-value = 0.1359
The basis of the non-conformable error is that your matrix algebra isn't working, the number of cols in the first matrix have to match the number of rows in the second. Having no knowledge of VAR models, I can't offer help beyond this.

Plot histogram by first sorting data and then dividing x values into bins in R

I have a dataset in a given format:
USER.ID avgfrequency
1 3 3.7821782
2 7 14.7500000
3 9 13.4761905
4 13 5.1967213
5 16 6.7812500
6 26 41.7500000
7 49 13.6666667
8 50 7.0000000
9 51 1.0000000
10 52 17.7500000
11 69 4.5000000
12 75 9.9500000
13 91 84.2000000
14 98 8.0185185
15 138 14.2000000
16 139 34.7500000
17 149 7.6666667
18 155 35.3333333
19 167 24.0000000
20 170 7.3529412
21 171 4.4210526
22 175 6.5781250
23 176 19.2857143
24 177 10.4864865
25 178 28.0000000
26 180 4.8461538
27 183 25.5000000
28 184 13.0000000
29 210 32.0000000
30 215 13.4615385
31 220 11.3611111
32 223 26.2500000
I want to first sort the dataset by avgfrequency and then I want to plot count of USER.ID's that fall under different bin categories.
I want to divide avgfrequency into different bin categories of width 10.
I am trying to sort data using:
user_avgfrequency <- user_avgfrequency[order(user_avgfrequency[,1]), ]
but getting an error.
df <- data.frame(USER.ID=c(3,7,9,13,16,26,49,50,51,52,69,75,91,98,138,139,149,155,167,170,171,175,176,177,178,180,183,184,210,215,220,223), avgfrequency=c(3.7821782,14.7500000,13.4761905,5.1967213,6.7812500,41.7500000,13.6666667,7.0000000,1.0000000,17.7500000,4.5000000,9.9500000,84.2000000,8.0185185,14.2000000,34.7500000,7.6666667,35.3333333,24.0000000,7.3529412,4.4210526,6.5781250,19.2857143,10.4864865,28.0000000,4.8461538,25.5000000,13.0000000,32.0000000,13.4615385,11.3611111,26.2500000) );
breaks <- seq(0,ceiling(max(df$avgfrequency)/10)*10,10);
cols <- colorRampPalette(c('blue','green','red'))(length(breaks)-1);
hist(df$avgfrequency,breaks,col=cols,axes=F,xlab='Average Frequency',ylab='Count');
axis(1,breaks);
axis(2,0:max(tabulate(cut(df$avgfrequency,breaks))));

ggplot with data frame columns

I am totally lost with using ggplot. I've tried with various solutions, but none were successful. Using numbers below, I want to create a line graph where the three lines, each representing df$c, df$d, and df$e, the x-axis representing df$a, and the y-axis representing the cumulative probability where 95=100%.
a b c d e
1 0 18 0.047368421 0.036842105 0.005263158
2 1 20 0.047368421 0.036842105 0.010526316
13 2 26 0.052631579 0.031578947 0.026315789
20 3 35 0.084210526 0.036842105 0.031578947
22 4 41 0.068421053 0.052631579 0.047368421
24 5 88 0.131578947 0.068421053 0.131578947
26 7 90 0.131578947 0.068421053 0.136842105
27 8 93 0.126315789 0.068421053 0.147368421
28 9 96 0.126315789 0.073684211 0.152631579
3 10 115 0.105263158 0.078947368 0.210526316
4 11 116 0.105263158 0.084210526 0.210526316
5 12 120 0.094736842 0.084210526 0.226315789
6 13 128 0.105263158 0.073684211 0.247368421
7 14 129 0.100000000 0.073684211 0.252631579
8 15 154 0.031578947 0.042105263 0.368421053
9 16 155 0.031578947 0.036842105 0.373684211
10 17 158 0.036842105 0.036842105 0.378947368
11 18 161 0.036842105 0.031578947 0.389473684
12 19 163 0.026315789 0.031578947 0.400000000
14 20 169 0.026315789 0.021052632 0.421052632
15 21 171 0.015789474 0.021052632 0.431578947
16 22 174 0.010526316 0.021052632 0.442105263
17 24 176 0.010526316 0.021052632 0.447368421
18 25 186 0.005263158 0.005263158 0.484210526
19 26 187 0.005263158 0.000000000 0.489473684
21 35 188 0.005263158 0.005263158 0.489473684
23 40 189 0.005263158 0.000000000 0.494736842
25 60 190 0.000000000 0.000000000 0.500000000
I was somewhat successful with using R base coding
plot(df$a, df$c, type="l",col="red")
lines(df$a, df$d, col="green")
lines(df$a, df$e, col="blue")
You first need to melt your data so that you have one column that designates from which variables the data comes from (call it variable) and another column that lists actual value (call it value). Study the example below to fully understand what happens to the variables from the original data.frame you want to keep constant.
library(reshape2)
xymelt <- melt(xy, id.vars = "a")
library(ggplot2)
ggplot(xymelt, aes(x = a, y = value, color = variable)) +
theme_bw() +
geom_line()
ggplot(xymelt, aes(x = a, y = value)) +
theme_bw() +
geom_line() +
facet_wrap(~ variable)
This code is also drawing column from your data called "d". You can remove it prior to melting, after melting, prior to plotting... or plot it.

plotting multiple variables in ggplot

I have a data table which looks like this-
pos gtt1 gtt2 ftp1 ftp2
8 100 123 49 101
9 85 93 99 110
10 111 102 53 113
11 88 110 59 125
12 120 118 61 133
13 90 136 64 145
14 130 140 104 158
15 78 147 74 167
16 123 161 81 173
17 160 173 88 180
18 117 180 94 191
19 89 188 104 199
20 175 197 107 213
I want to make a line graph with pos (position) on the x-axis using ggplot. I am trying to show gtt1 and gtt2 lines in one colour and ftp1 and ftp2 in another colour, because they are separate groups (gtt and ftp) of samples. I have successfully created the graph, but all four lines are in different colours. I would like to keep only gtt and ftp in the legend (not all four). Bonus, how can I make these lines little smooth.
Here is what I did so far:
library(reshape2);library(ggplot2)
data <- read.table("myfile.txt",header=TRUE,sep="\t")
data.melt <- melt(data,id="pos")
ggplot(data.melt,aes(x=pos, y=value,colour=variable))+geom_line()
Thanks in advance
The easiest way is to re-shape your data in a slightly different way:
dd1 = melt(dd[,1:3], id=c("pos"))
dd1$type = "gtt"
dd2 = melt(dd[,c(1, 4:5)], id=c("pos"))
dd2$type = "ftp"
dd.melt = rbind(dd1, dd2)
Now we have a column specifying the variable "type":
R> head(dd.melt, 2)
pos variable value type
1 8 gtt1 100 gtt
2 9 gtt1 85 gtt
Once the data is in this format, the ggplot command is straightforward:
ggplot(dd.melt,aes(x=pos, y=value))+
geom_line(aes(colour=type, group=variable)) +
scale_colour_manual(values=c(gtt="blue", ftp="red"))
You can add smoothed lines using stat_smooth:
##span controls the smoothing
g + stat_smooth(se=FALSE, span=0.5)

Resources