I have a dataset where I have three different groups of individuals, let´s call them Green, Red, and Blue. Then I have data covering 92 proteins in their blood, from which I have readings for each individual in each group.
I would like to get a good overview of the variances and means for each protein for each group. Which means that I would like to make a multiple box plot graph.
I would like to have the different proteins on the x-axis, and three box plots (preferably in different colors) (one for each group) above every protein, with numeric protein weight on the y-axis.
How do I do this?
I am currently working with a data frame where the groups are divided by the rows, and the different protein readings is in each column.
Tried to add a picture, but apparently you need reputation-points…
I´ve heard that you can use the melt command in reshape2, but I need guidance in how to use it.
Please, simplify the answers. I´m not very experienced when it comes to R.
Look, I realize things are frustrating when you are first getting started, but you're going to have to ask specific and targeted questions for people to be willing and able to help you out in a structured way.
Having said that, let's walk through a structured example. I am only going to use 9 proteins here, but you should get the idea.
library(ggplot2)
library(reshape2)
# Setup a data frame, since the question did not provide one...
df <- structure(list(Individual = 1:12,
Group = structure(c(2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L),
.Label = c("Blue", "Green", "Red"), class = "factor"),
Protein_1 = c(82L, 23L, 19L, 100L, 33L, 86L, 32L, 41L, 39L, 59L, 93L, 99L),
Protein_2 = c(86L, 50L, 86L, 90L, 37L, 20L, 26L, 38L, 87L, 81L, 23L, 49L),
Protein_3 = c(81L, 31L, 5L, 10L, 79L, 40L, 27L, 73L, 64L, 30L, 87L, 64L),
Protein_4 = c(52L, 15L, 25L, 12L, 63L, 52L, 60L, 33L, 27L, 32L, 53L, 93L),
Protein_5 = c(19L, 75L, 25L, 14L, 33L, 60L, 73L, 13L, 92L, 92L, 91L, 12L),
Protein_6 = c(33L, 49L, 29L, 58L, 51L, 12L, 61L, 48L, 71L, 18L, 84L, 31L),
Protein_7 = c(84L, 57L, 28L, 99L, 47L, 54L, 72L, 97L, 73L, 46L, 68L, 37L),
Protein_8 = c(15L, 16L, 46L, 95L, 57L, 86L, 30L, 83L, 45L, 12L, 49L, 82L),
Protein_9 = c(84L, 91L, 33L, 10L, 91L, 91L, 4L, 88L, 42L, 82L, 76L, 95L)),
.Names = c("Individual", "Group", "Protein_1", "Protein_2", "Protein_3",
"Protein_4", "Protein_5", "Protein_6", "Protein_7", "Protein_8", "Protein_9"),
class = "data.frame", row.names = c(NA, -12L))
head(df)
# Individual Group Protein_1 Protein_2 Protein_3 Protein_4 Protein_5 Protein_6 Protein_7 Protein_8 Protein_9
# 1 1 Green 82 86 81 52 19 33 84 15 84
# 2 2 Blue 23 50 31 15 75 49 57 16 91
# 3 3 Red 19 86 5 25 25 29 28 46 33
# 4 4 Green 100 90 10 12 14 58 99 95 10
# 5 5 Blue 33 37 79 63 33 51 47 57 91
# 6 6 Red 86 20 40 52 60 12 54 86 91
?melt
df.melted <- melt(df, id.vars = c("Individual", "Group"))
head(df.melted)
# Individual Group variable value
# 1 1 Green Protein_1 82
# 2 2 Blue Protein_1 23
# 3 3 Red Protein_1 19
# 4 4 Green Protein_1 100
# 5 5 Blue Protein_1 33
# 6 6 Red Protein_1 86
# First Protein
# Notice I am using subset()
ggplot(data = subset(df.melted, variable == "Protein_1"),
aes(x = Group, y = value)) + geom_boxplot(aes(fill = Group))
# Second Protein
ggplot(data = subset(df.melted, variable == "Protein_2"),
aes(x = Group, y = value)) + geom_boxplot(aes(fill = Group))
# and so on...
# You could also use facets
ggplot(data = df.melted, aes(x = Group, y = value)) +
geom_boxplot(aes(fill = Group)) +
facet_wrap(~ variable)
And yes, I realize that the color groupings do not align with the colors of the plot...I will leave that as an exercise... You have to be willing to tinker, explore, and fail many times.
Related
ind
set
inst_0
inst_1
inst_2
Inst_3
inst_4
inst_5
0
1
20
30
50
55
58
60
0
2
34
44
46
67
89
70
0
3
37
89
78
80
90
98
0
4
23
45
67
89
87
89
1
1
34
56
65
78
77
89
1
2
23
32
45
55
66
77
1
3
35
69
88
99
98
57
1
4
23
45
56
78
89
99
2
1
23
34
55
55
77
88
2
2
12
44
55
67
88
90
2
3
12
66
77
91
44
99
2
4
45
55
88
31
56
100
I have a data frame like this above and I would like to make a plot showing this kind of a trend like in the graph below( this is only made for 4 individual in a same set) for the combinations of for example Ind0-set1, Ind1-set1, Ind2-set2...,Ind0-set2,Ind1-set2 and second question is that how to plot multiple line graph separately for each set in one graph?
I am not sure to use ggplot2 or it can be done plot function too.
If you want to do this using ggplot2 then the first step would be to reshape your data to long or tidy format using e.g. tidyr::pivot_longer:
library(tidyr)
library(dplyr)
library(ggplot2)
# Reshape to long
dat <- dat %>%
# Convert all column names to lower case
rename_with(tolower) %>%
pivot_longer(-c(ind, set), names_to = "inst", values_to = "value", names_prefix = "inst_")
After doing so you could create a plot showing all individuals for all sets by using facetting:
ggplot(dat, aes(inst, value, color = factor(ind), group = ind)) +
geom_line() +
geom_point() +
facet_wrap(~set)
Or you could filter your data for your desired combinations to create a plot for e.g. just one set like so:
dat_filtered <- dat[dat$set == 1, ]
ggplot(dat_filtered, aes(inst, value, color = factor(ind), group = ind)) +
geom_line() +
geom_point()
DATA
dat <- data.frame(
ind = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
set = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
inst_0 = c(20L, 34L, 37L, 23L, 34L, 23L, 35L, 23L, 23L, 12L, 12L, 45L),
inst_1 = c(30L, 44L, 89L, 45L, 56L, 32L, 69L, 45L, 34L, 44L, 66L, 55L),
inst_2 = c(50L, 46L, 78L, 67L, 65L, 45L, 88L, 56L, 55L, 55L, 77L, 88L),
Inst_3 = c(55L, 67L, 80L, 89L, 78L, 55L, 99L, 78L, 55L, 67L, 91L, 31L),
inst_4 = c(58L, 89L, 90L, 87L, 77L, 66L, 98L, 89L, 77L, 88L, 44L, 56L),
inst_5 = c(60L, 70L, 98L, 89L, 89L, 77L, 57L, 99L, 88L, 90L, 99L, 100L)
)
I want to convert values 0<x<9 & x<50.2 to NA.My data frame has the first five columns which do not have to have replaced, I only want to replace values in column 6 to column 60. I have tried to it in 2 steps as follows but, it also replaced values I dont intend to change
BdsDf[BdsDf > 50.2][6:60] <- NA; BdsDf[BdsDf < 9][6:60] <- NA
Here is one way:
# test data
df = data.frame(lapply(1:60, function(x) {
rnorm(100, 50, 25)
}))
df[,6:60][0 < df[,6:60] & df[,6:60] < 9 & df[,6:60] < 50.2] = NA
With naniar, you can use replace_with_na_at to select the specific columns and add the 2 conditions.
library(dplyr)
library(naniar)
BdsDf %>%
replace_with_na_at(.vars = -1:5,
condition = ~ .x < 9 | .x > 50.2)
Or in base R, you can do the following with the sample dataset. For your data, you would just change 6:8 to 6:60 (like with the commented out portion below).
BdsDf[,6:8][BdsDf[6:8] > 50.2 | BdsDf[6:8] < 9] <- NA
# BdsDf[,6:60][BdsDf[6:60] > 50.2 | BdsDf[6:60] < 9] <- NA
Output
X1 X2 X3 X4 X5 X6 X7 X8
1 75 86 91 16 6 NA NA NA
2 84 68 8 85 19 NA 38 NA
3 18 52 5 17 59 NA 28 NA
4 97 86 45 17 31 NA NA 28
5 41 95 60 80 49 NA 24 NA
6 47 56 65 35 44 18 NA 19
7 29 46 3 22 36 15 NA 10
8 48 50 60 38 47 NA NA 35
9 91 20 50 5 24 40 47 19
10 85 84 15 71 96 NA NA 26
Data
BdsDf <- structure(list(X1 = c(75L, 84L, 18L, 97L, 41L, 47L, 29L, 48L,
91L, 85L), X2 = c(86L, 68L, 52L, 86L, 95L, 56L, 46L, 50L, 20L,
84L), X3 = c(91L, 8L, 5L, 45L, 60L, 65L, 3L, 60L, 50L, 15L),
X4 = c(16L, 85L, 17L, 17L, 80L, 35L, 22L, 38L, 5L, 71L),
X5 = c(6L, 19L, 59L, 31L, 49L, 44L, 36L, 47L, 24L, 96L),
X6 = c(66L, 0L, 84L, 3L, 84L, 18L, 15L, 60L, 40L, 67L), X7 = c(73L,
38L, 28L, 6L, 24L, 91L, 79L, 7L, 47L, 88L), X8 = c(92L, 57L,
66L, 28L, 85L, 19L, 10L, 35L, 19L, 26L)), class = "data.frame", row.names = c(NA,
-10L))
I have a raw multidimensional contingency table that I want to convert to tidy data or other long form so that I can fit a logistic regression on it. I have found great methods for portions of it. But I'd like a strategy for dealing with whole thing iteratively.
Here's a half of it formatted:
White
<35 35-44 >44
Region M F M F M F
Northeast
Satisfied 288 60 224 35 337 70
Not satisfied 177 57 166 19 172 30
Mid-Atlantic
Satisfied 90 19 96 12 124 17
Not satisfied 45 12 42 5 39 2
Southern
Satisfied 226 88 189 44 156 70
Not satisfied 128 57 117 34 73 25
Here's the full raw data, stripped of its headers:
> dput(df_raw)
structure(list(V1 = c(288L, 177L, 90L, 45L, 226L, 128L), V2 = c(60L,
57L, 19L, 12L, 88L, 57L), V3 = c(224L, 166L, 96L, 42L, 189L,
117L), V4 = c(35L, 19L, 12L, 5L, 44L, 34L), V5 = c(337L, 172L,
124L, 39L, 156L, 73L), V6 = c(70L, 30L, 17L, 2L, 70L, 25L), V7 = c(38L,
33L, 18L, 6L, 45L, 31L), V8 = c(19L, 35L, 13L, 7L, 47L, 35L),
V9 = c(32L, 11L, 7L, 2L, 18L, 3L), V10 = c(22L, 20L, 0L,
3L, 13L, 7L), V11 = c(21L, 8L, 9L, 2L, 11L, 2L), V12 = c(15L,
10L, 1L, 1L, 9L, 2L)), class = "data.frame", row.names = c(NA,
-6L))
Here's how I can take care of one section:
ne35 <- data.frame(c(288, 177), c(60, 57))
colnames(ne35) <- c("Male", "Female")
rownames(ne35) <- c("Sat", "Unsat")
ne35 %>%
rownames_to_column() %>% # set row names as a variable
gather(rowname2,value,-rowname) %>% # reshape
rowwise() %>% # for every row
mutate(value = list(1:value)) %>% # numbers based on the value
unnest(value) %>% # unnest the counter
select(-value) # remove the counts
# A tibble: 582 x 2
rowname rowname2
<chr> <chr>
1 Sat Male
2 Sat Male
3 Sat Male
4 Sat Male
5 Sat Male
6 Sat Male
7 Sat Male
8 Sat Male
9 Sat Male
10 Sat Male
# … with 572 more rows
I am stumped, however, on how to apply this to a few tiers of categorical variables.
This question already has answers here:
Add (insert) a column between two columns in a data.frame
(18 answers)
Closed 4 years ago.
Suppose i have dataset
df=structure(list(SaleCount = c(7L, 35L, 340L, 260L, 3L, 31L, 420L,
380L, 45L, 135L, 852L, 1L, 34L, 360L, 140L, 14L, 62L, 501L, 560L,
0L, 640L, 0L, 0L, 16L, 0L), DocumentNum = c(36L, 4L, 41L, 41L,
36L, 4L, 41L, 41L, 33L, 33L, 33L, 36L, 4L, 41L, 41L, 33L, 33L,
33L, 62L, 63L, 62L, 63L, 36L, 4L, 41L)), .Names = c("SaleCount",
"DocumentNum"), class = "data.frame", row.names = c(NA, -25L))
i need create the column, but this column must be second by order.
If i do so:
df["MY_NEW_COLUMN"] <- NA .
The new colums is third.
How it create that it was second by order?
I.E. i expect output
SaleCount newcolumn DocumentNum
1 7 NA 36
2 35 NA 4
3 340 NA 41
4 260 NA 41
5 3 NA 36
6 31 NA 4
7 420 NA 41
8 380 NA 41
9 45 NA 33
10 135 NA 33
11 852 NA 33
12 1 NA 36
13 34 NA 4
14 360 NA 41
15 140 NA 41
16 14 NA 33
17 62 NA 33
18 501 NA 33
19 560 NA 62
20 0 NA 63
21 640 NA 62
22 0 NA 63
23 0 NA 36
24 16 NA 4
25 0 NA 41
Of course sometimes I need to create a fourth column by order and so on.
You can use the dplyr library and the select function.
library(dplyr)
df=structure(list(SaleCount = c(7L, 35L, 340L, 260L, 3L, 31L, 420L,
380L, 45L, 135L, 852L, 1L, 34L, 360L, 140L, 14L, 62L, 501L, 560L,
0L, 640L, 0L, 0L, 16L, 0L), DocumentNum = c(36L, 4L, 41L, 41L,
36L, 4L, 41L, 41L, 33L, 33L, 33L, 36L, 4L, 41L, 41L, 33L, 33L,
33L, 62L, 63L, 62L, 63L, 36L, 4L, 41L)), .Names = c("SaleCount",
"DocumentNum"), class = "data.frame", row.names = c(NA, -25L))
df["MY_NEW_COLUMN"] <- NA
select(df,SaleCount, MY_NEW_COLUMN, DocumentNum)
I am newbie in R and I am having troubles with summarizing data. I tried to follow tutorials in internet, but unfortunately I had errors all time.
I have a matrix where my response factor is "concentration"
In my experiment I have 3 treatments (a, b and c) and 5 replicates for each treatment. And I get the concentrations of 8 products (PRO1 - PRO8).
TRA PRO1 PRO2 PRO3 PRO4 PRO5 PRO6 PRO7 PRO8
1 a 83 85 59 46 64 8 76 74
2 a 61 71 73 15 87 95 61 9
3 a 78 12 35 23 56 95 67 11
4 a 48 30 75 94 57 15 58 58
5 a 51 92 30 60 22 9 64 5
6 b 46 17 66 79 30 99 3 38
7 b 40 25 11 18 66 25 55 38
8 b 34 94 83 63 30 100 56 31
9 b 3 81 26 73 32 56 4 12
10 b 18 40 13 51 4 44 75 4
11 c 68 28 20 15 13 56 5 82
12 c 50 85 65 85 13 13 34 69
13 c 75 37 11 55 58 69 85 67
14 c 71 30 83 46 87 67 59 70
15 c 10 76 50 20 98 81 57 76
I tried the summaryBy, doBy and these functions, however it did not work for me.
How can I sort my matrix in order to execute these functions and get the mean, sd? My intention is to make a barplot with error bars to see the differences between the treatments for each product.
Thanks
You may try
library(ggplot2)
library(dplyr)
library(tidyr)
gather(df1, Var, Val, -TRA) %>%
group_by(TRA, Var) %>%
summarise(Mean=mean(Val), SD=sd(Val)) %>%
ggplot(., aes(x=TRA, y=Mean, fill=Var))+
geom_bar(position=position_dodge(), stat='identity')+
geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD), width=.2,
position=position_dodge(.9))
data
df1 <- structure(list(TRA = c("a", "a", "a", "a", "a", "b", "b", "b",
"b", "b", "c", "c", "c", "c", "c"), PRO1 = c(83L, 61L, 78L, 48L,
51L, 46L, 40L, 34L, 3L, 18L, 68L, 50L, 75L, 71L, 10L), PRO2 = c(85L,
71L, 12L, 30L, 92L, 17L, 25L, 94L, 81L, 40L, 28L, 85L, 37L, 30L,
76L), PRO3 = c(59L, 73L, 35L, 75L, 30L, 66L, 11L, 83L, 26L, 13L,
20L, 65L, 11L, 83L, 50L), PRO4 = c(46L, 15L, 23L, 94L, 60L, 79L,
18L, 63L, 73L, 51L, 15L, 85L, 55L, 46L, 20L), PRO5 = c(64L, 87L,
56L, 57L, 22L, 30L, 66L, 30L, 32L, 4L, 13L, 13L, 58L, 87L, 98L
), PRO6 = c(8L, 95L, 95L, 15L, 9L, 99L, 25L, 100L, 56L, 44L,
56L, 13L, 69L, 67L, 81L), PRO7 = c(76L, 61L, 67L, 58L, 64L, 3L,
55L, 56L, 4L, 75L, 5L, 34L, 85L, 59L, 57L), PRO8 = c(74L, 9L,
11L, 58L, 5L, 38L, 38L, 31L, 12L, 4L, 82L, 69L, 67L, 70L, 76L
)), .Names = c("TRA", "PRO1", "PRO2", "PRO3", "PRO4", "PRO5",
"PRO6", "PRO7", "PRO8"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))