Related
I have the following type of table :
df0 <- read.table(text = 'Sample Method Mg Al Ca Ti
Sa A 5.5 2.2 33 0.2
Sb A 4.2 1.2 44 0.1
Sc A 1.1 0.5 25 0.3
Sd A 3.3 1.3 31 0.5
Se A 6.2 0.2 55 0.6
Sa B 5.2 2 35 0.25
Sb B 4.6 1.3 48 0.1
Sc B 1.6 0.8 22 0.32
Sd B 3.1 1.6 29 0.4
Se B 6.8 0.3 51 0.7
Sa C 5.6 2.5 30 0.2
Sb C 4.1 1.2 41 0.15
Sc C 1 0.6 22 0.4
Sd C 3.2 1.5 30 0.5
Se C 6.8 0.1 51 0.65', header = T, stringsAsFactors = F)
Which include chemical compositions. I would like to use the Method A as a reference (X-axis) and to make automated scatter plots with the data from Method B, C in Y (with linear trend). With a reference line of 1:1 which would correspond to a perfect match.
In other words, I would like to produce plots like that :
I think a solution could start from transforming the data frame into:
df <- read.table(text = 'Sample Mg_A Al_A Ca_A Ti_A Mg_B Al_B Ca_B Ti_B Mg_C Al_C Ca_C Ti_C
Sa 5.5 2.2 33 0.2 5.2 2 35 0.25 5.6 2.5 30 0.2
Sb 4.2 1.2 44 0.1 4.6 1.3 48 0.1 4.1 1.2 41 0.15
Sc 1.1 0.5 25 0.3 1.6 0.8 22 0.32 1 0.6 22 0.4
Sd 3.3 1.3 31 0.5 3.1 1.6 29 0.4 3.2 1.5 30 0.5
Se 6.2 0.2 55 0.6 6.8 0.3 51 0.7 6.8 0.1 51 0.65
', header = T, stringsAsFactors = F)
But I don't know how to go further.
Any help would be appreciated.
Best, Anne-Christine
You can use the following code
library(tidyverse)
df0 %>%
pivot_wider(names_from = Method, values_from = c(Mg, Al, Ca, Ti)) %>%
pivot_longer(cols = -Sample) %>% #wide to long data format
separate(name, c("key","number"), sep = "_") %>%
group_by(number) %>% #Group the vaules according to number
mutate(row = row_number()) %>% #For creating unique IDs
pivot_wider(names_from = number, values_from = value) %>%
ggplot() +
geom_point(aes(x=A, y=B, color = "A vs B")) +
geom_point(aes(x=A, y=C, color = "A vs C")) +
geom_abline(slope=1, intercept=0) +
geom_smooth(aes(x=A, y=B, color = "A vs B"), method=lm, se=FALSE, fullrange=TRUE)+
geom_smooth(aes(x=A, y=C, color = "A vs C"), method=lm, se=FALSE, fullrange=TRUE)+
facet_wrap(key~., scales = "free")+
theme_bw()+
ylab("B or C") +
xlab("A")
Data
df0 = structure(list(Sample = c("Sa", "Sb", "Sc", "Sd", "Se", "Sa",
"Sb", "Sc", "Sd", "Se", "Sa", "Sb", "Sc", "Sd", "Se"), Method = c("A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C",
"C"), Mg = c(5.5, 4.2, 1.1, 3.3, 6.2, 5.2, 4.6, 1.6, 3.1, 6.8,
5.6, 4.1, 1, 3.2, 6.8), Al = c(2.2, 1.2, 0.5, 1.3, 0.2, 2, 1.3,
0.8, 1.6, 0.3, 2.5, 1.2, 0.6, 1.5, 0.1), Ca = c(33L, 44L, 25L,
31L, 55L, 35L, 48L, 22L, 29L, 51L, 30L, 41L, 22L, 30L, 51L),
Ti = c(0.2, 0.1, 0.3, 0.5, 0.6, 0.25, 0.1, 0.32, 0.4, 0.7,
0.2, 0.15, 0.4, 0.5, 0.65)), class = "data.frame", row.names = c(NA,
-15L))
Is there a way in R to carry out an ANOVA test from a table of data that looks as follows:
Trees Avg_number_1m Avg_number_2m Avg_number_3m Avg_number_4m
1 Tree_1 15.2 15.0 15.2 12.0
2 Tree_2 16.2 15.4 14.2 15.4
3 Tree_3 14.4 9.2 3.2 1.6
4 Tree_4 14.6 5.6 10.4 9.2
5 Tree_5 15.2 13.0 7.4 3.0
6 Tree_6 14.0 12.0 13.0 11.2
7 Tree_7 13.8 7.8 7.2 2.0
8 Tree_8 10.8 5.8 4.4 2.4
9 Tree_9 12.4 9.6 6.8 2.6
10 Tree_10 15.6 11.0 7.2 1.8
11 Tree_11 7.6 7.4 9.0 1.8
12 Tree_12 13.8 7.8 7.2 2.0
13 Tree_13 10.8 5.8 4.4 1.6
14 Tree_14 15.2 15.0 15.2 12.0
15 Tree_15 16.2 15.4 14.2 15.0
16 Tree_16 12.4 9.2 3.2 1.6
17 Tree_17 14.6 5.6 10.4 9.2
18 Tree_18 15.2 13.0 7.4 3.0
19 Tree_19 14.0 14.4 13.2 13.8
20 Tree_20 11.0 5.2 4.4 0.8
I've tried to find tutorials on how to do this but the fact that the aov command requires one x and one y variable has been throwing me off. Any help is much appreciated.
So this is your data:
x = structure(list(Trees = structure(c(1L, 12L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L), .Label = c("Tree_1",
"Tree_10", "Tree_11", "Tree_12", "Tree_13", "Tree_14", "Tree_15",
"Tree_16", "Tree_17", "Tree_18", "Tree_19", "Tree_2", "Tree_20",
"Tree_3", "Tree_4", "Tree_5", "Tree_6", "Tree_7", "Tree_8", "Tree_9"
), class = "factor"), Avg_number_1m = c(15.2, 16.2, 14.4, 14.6,
15.2, 14, 13.8, 10.8, 12.4, 15.6, 7.6, 13.8, 10.8, 15.2, 16.2,
12.4, 14.6, 15.2, 14, 11), Avg_number_2m = c(15, 15.4, 9.2, 5.6,
13, 12, 7.8, 5.8, 9.6, 11, 7.4, 7.8, 5.8, 15, 15.4, 9.2, 5.6,
13, 14.4, 5.2), Avg_number_3m = c(15.2, 14.2, 3.2, 10.4, 7.4,
13, 7.2, 4.4, 6.8, 7.2, 9, 7.2, 4.4, 15.2, 14.2, 3.2, 10.4, 7.4,
13.2, 4.4), Avg_number_4m = c(12, 15.4, 1.6, 9.2, 3, 11.2, 2,
2.4, 2.6, 1.8, 1.8, 2, 1.6, 12, 15, 1.6, 9.2, 3, 13.8, 0.8)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
We can very quickly visualize your data using boxplot, and it shows that there are fewer spines at greater heights:
So we load a few libraries to get the data in the correct shape:
library(ggplot2)
library(tidyr)
# first we make it a "long" format
df = pivot_longer(x,-Trees,names_to="Height_levels")
Now we visualize for each individual tree how it looks like:
ggplot(df,aes(x=Height_levels,y=value,col=Trees)) + geom_point() +
geom_line(aes(group=Trees)) + theme(legend.position="top")
These tells us two things, we need to adjust the Tree, and then test when there are differences between the heights, the most straightfoward is to use an anova to test:
aovfit = aov(value ~ Trees + Height_levels,data=df)
summary(aovfit)
Df Sum Sq Mean Sq F value Pr(>F)
Trees 19 877.9 46.20 7.692 8.98e-10 ***
Height_levels 3 588.9 196.31 32.682 2.02e-12 ***
Residuals 57 342.4 6.01
And post-hoc with Tukey:
posthoc = TukeyHSD(aovfit)
posthoc$Height_levels
diff lwr upr p adj
Avg_number_2m-Avg_number_1m -3.49 -5.54109 -1.4389103 1.930647e-04
Avg_number_3m-Avg_number_1m -4.77 -6.82109 -2.7189103 4.752523e-07
Avg_number_4m-Avg_number_1m -7.55 -9.60109 -5.4989103 1.182687e-11
Avg_number_3m-Avg_number_2m -1.28 -3.33109 0.7710897 3.586375e-01
Avg_number_4m-Avg_number_2m -4.06 -6.11109 -2.0089103 1.429319e-05
Avg_number_4m-Avg_number_3m -2.78 -4.83109 -0.7289103 3.779450e-03
If you would like, you can also fit a linear model, where the height is a continuous variable, and test it with an anova:
df$Height = as.numeric(gsub("[^0-9]","",as.character(df$Height_levels)))
aov_continuous = aov(value ~ Trees + Height,data=df)
summary(aov_continuous)
Df Sum Sq Mean Sq F value Pr(>F)
Trees 19 877.9 46.2 7.601 7.74e-10 ***
Height 1 572.6 572.6 94.199 7.78e-14 ***
Residuals 59 358.7 6.1
And coefficients tell you how much lesser spines on average you get, by going up 1 m. In this case, it's about -2.39..
aov_continuous$coefficients
[...]
Height
-2.393000e+00
I have a character field in a dataframe that contains numbers e.g. (0.5,3.5,7.8,2.4).
For every record I am trying to extract the largest value from the string and put it in a new column.
e.g.
x csi
1 0.5, 6.7, 2.3
2 9.5, 2.6, 1.1
3 0.7, 2.3, 5.1
4 4.1, 2.7, 4.7
The desired output would be:
x csi csi_max
1 0.5, 6.7, 2.3 6.7
2 9.5, 2.6, 1.1 9.5
3 0.7, 2.3, 5.1 5.1
4 4.1, 2.7, 4.7 4.7
I have had various attempts ...with my latest attempt being the following - which provides the maximum csi score from the entire column rather than from the individual row's csi numbers...
library(stringr)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
df$max_csi <- max(numextract(df$csi))
Thank you
We can use tidyverse
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(csi) %>%
group_by(x) %>%
summarise(csi_max = max(csi)) %>%
left_join(df1, .)
# x csi csi_max
#1 1 0.5, 6.7, 2.3 6.7
#2 2 9.5, 2.6, 1.1 9.5
#3 3 0.7, 2.3, 5.1 5.1
#4 4 4.1, 2.7, 4.7 4.7
Or this can be done with pmax from base R after separating the 'csi' column into a data.frame with read.table
df1$csi_max <- do.call(pmax, read.table(text=df1$csi, sep=","))
Hope this helps!
df$csi_max <- sapply(df$csi, function(x) max(as.numeric(unlist(strsplit(as.character(x), split=",")))))
Output is:
x csi csi_max
1 1 0.5, 6.7, 2.3 6.7
2 2 9.5, 2.6, 1.1 9.5
3 3 0.7, 2.3, 5.1 5.1
4 4 4.1, 2.7, 4.7 4.7
#sample data
> dput(df)
structure(list(x = 1:4, csi = structure(c(1L, 4L, 2L, 3L), .Label = c("0.5, 6.7, 2.3",
"0.7, 2.3, 5.1", "4.1, 2.7, 4.7", "9.5, 2.6, 1.1"), class = "factor")), .Names = c("x",
"csi"), class = "data.frame", row.names = c(NA, -4L))
Edit:
As suggested by #RichScriven, the more efficient way could be
df$csi_max <- sapply(strsplit(as.character(df$csi), ","), function(x) max(as.numeric(x)))
A solution using the splitstackshape package.
library(splitstackshape)
dat$csi_max <- apply(cSplit(dat, "csi")[, -1], 1, max)
dat
# x csi csi_max
# 1 1 0.5, 6.7, 2.3 6.7
# 2 2 9.5, 2.6, 1.1 9.5
# 3 3 0.7, 2.3, 5.1 5.1
# 4 4 4.1, 2.7, 4.7 4.7
DATA
dat <- read.table(text = "x csi
1 '0.5, 6.7, 2.3'
2 '9.5, 2.6, 1.1'
3 '0.7, 2.3, 5.1'
4 '4.1, 2.7, 4.7'",
header = TRUE, stringsAsFactors = FALSE)
I have a sample data set
ID Depth Salinity Temperature Time fluorescence
1 0 1.3 29.2 13:44:23 152
2 3.1 1.4 29.2 13:44:26 175
3 3.5 2 29.2 13:44:30 149
4 4.3 2.6 29.2 13:44:34 192
5 7.5 2.9 29.4 13:44:37 174
6 8.2 2.1 29.1 13:44:41 154
7 10 2.6 29.1 13:44:44 147
8 9.1 2.6 29.1 13:44:48 150
9 7.3 2.7 28.9 13:44:52 147
10 5.2 3.2 29.0 13:44:55 180
11 4.5 2 29.0 13:44:59 167
12 3.3 2.3 29.1 13:45:03 154
13 2.5 1.8 29.1 13:45:06 106
14 0 1.5 29.1 13:45:10 136
I want two profiles Up and Down profile i.e. from depth 0-10 and 10-0 in a same plot. I used the code below to generate a plot
meltdf <- mutate(meltdf, trend = c(rep("UP",7), rep("DOWN",7)))
p <- ggplot(meltdf, aes(x = Temperature, y = Depth, color = trend)) +
geom_line()+
p
I get the plot with this. However, what I want is Depth in y axis and Salinity, Temperature, fluorescence in multiple x axis in the same graph. As they have varying ranges I don't know how i should set it.
Also the data i have is quite big and when i plot i dont get a smooth curve(pic R plot) in my result .Is there a way to avoid those spikes?
You might be looking for something like this
Your data
df <- structure(list(ID = 1:14, Depth = c(0, 3.1, 3.5, 4.3, 7.5, 8.2,
10, 9.1, 7.3, 5.2, 4.5, 3.3, 2.5, 0), Salinity = c(1.3, 1.4,
2, 2.6, 2.9, 2.1, 2.6, 2.6, 2.7, 3.2, 2, 2.3, 1.8, 1.5), Temperature = c(29.2,
29.2, 29.2, 29.2, 29.4, 29.1, 29.1, 29.1, 28.9, 29, 29, 29.1,
29.1, 29.1), Time = c("13:44:23", "13:44:26", "13:44:30", "13:44:34",
"13:44:37", "13:44:41", "13:44:44", "13:44:48", "13:44:52", "13:44:55",
"13:44:59", "13:45:03", "13:45:06", "13:45:10"), fluorescence = c(152L,
175L, 149L, 192L, 174L, 154L, 147L, 150L, 147L, 180L, 167L, 154L,
106L, 136L)), .Names = c("ID", "Depth", "Salinity", "Temperature",
"Time", "fluorescence"), row.names = c(NA, -14L), class = c("data.table",
"data.frame"))
library(tidyverse)
meltdf <- mutate(df, trend = c(rep("UP",7), rep("DOWN",7)))
solution
Starting with meltdf, gather relevant x-axis variables
moremelt <- meltdf %>%
gather(key, value, Salinity, Temperature, fluorescence)
ggplot with facet_wrap using options nrow=3 and scale="free"
ggplot(moremelt, aes(x = value, y = Depth, color = interaction(trend,key), label=key)) +
geom_line(lwd=2) +
scale_colour_manual(values=c("orange","red","blue","cyan","black","grey")) +
facet_wrap(~key, nrow=3, scale="free")
My data frame is simple (and probably is not strictly a dataframe):
date MAE_f0 MAE_f1
1 20140101 0.2 0.2
2 20140102 1.9 0.1
3 20140103 0.1 0.3
4 20140104 7.8 15.9
5 20140105 1.9 4.6
6 20140106 0.8 0.8
7 20140107 0.5 0.6
8 20140108 0.2 0.2
9 20140109 0.2 0.2
10 20140110 0.8 1.1
11 20140111 0.2 0.2
12 20140112 0.4 0.4
13 20140113 2.8 0.9
14 20140114 5.4 5.8
15 20140115 0.2 0.3
16 20140116 4.9 3.1
17 20140117 3.7 6.0
18 20140118 1.4 2.1
19 20140119 0.9 3.0
20 20140120 0.2 3.6
21 20140121 0.3 0.3
22 20140122 0.4 0.4
23 20140123 0.6 1.7
24 20140124 6.1 4.7
25 20140125 0.1 0.0
26 20140126 7.4 4.9
27 20140127 0.8 0.9
28 20140128 0.3 0.3
29 20140129 3.0 4.2
30 20140130 9.9 17.3
On every day I've 2 variables: MAE for f0, and MAE for f1.
I can calculate frequency for my 2 variables on the whole time period using "cut" with the same intervals for both:
cut(mae.df$MAE_f0,c(0,2,5,10,50))
cut(mae.df$MAE_f1,c(0,2,5,10,50))
Well. Now I can use boxplot to plot variable versus it's frequency distribution:
boxplot(mae.df$MAE_f0~cut(mae.df$MAE_f0,c(0,2,5,10,50)))
boxplot(mae.df$MAE_f1~cut(mae.df$MAE_f1,c(0,2,5,10,50)))
The produced boxplot (2) are very simple (but I don't show it 'cause I've ho "reputation"): on x there are the intervals of frequency (0-2,2-5,5-10,10-50), on y the boxplot value for variable MAE_f0 for each interval.
Well, the question is very trivial: I'd like to have only one box plot, with both variables MAE_f0 and MAE_f1 and it's frequency distribution: I'd like to have is a plot with 2 boxplot for each frequency interval (I mean: 2 for 0-2, 2 for 2-5 and so on).
I know that my knowledge on R, data frame and so on is very poor, and, de facto, I'm missing something important about those arguments, specially on data frame and reshaping! Sorry in advance for that!But I've seen some nice examples in stackoverflow about grouping boxplot, all without time variable, and I'm not able to figure out how I can adjust my data frame for doing that.
I hope my question is not misplaced: sorry again for that.
Umbe
Here is how I would do this. I think it makes sense to melt your data first. A quick tutorial on melting your data is available here.
# First, make this reproducible by using dput for the data frame
df <- structure(list(date = 20140101:20140130, MAE_f0 = c(0.2, 1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4, 0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8, 0.3, 3, 9.9), MAE_f1 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2, 0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3, 0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("date", "MAE_f0", "MAE_f1"), row.names = c(NA, -30L), class = "data.frame")
require(ggplot2)
require(reshape2)
# Melt the original data frame
df2 <- melt(df, measure.vars = c("MAE_f0", "MAE_f1"))
head(df2)
# date variable value
# 1 20140101 MAE_f0 0.2
# 2 20140102 MAE_f0 1.9
# 3 20140103 MAE_f0 0.1
# 4 20140104 MAE_f0 7.8
# 5 20140105 MAE_f0 1.9
# 6 20140106 MAE_f0 0.8
# Create a "cuts" variable with the correct breaks
df2$cuts <- cut(df2$value,
breaks = c(-Inf, 2, 5, 10, +Inf),
labels = c("first cut", "second cut", "third cut", "fourth cut"))
head(df2)
# date variable value cuts
# 1 20140101 MAE_f0 0.2 first cut
# 2 20140102 MAE_f0 1.9 first cut
# 3 20140103 MAE_f0 0.1 first cut
# 4 20140104 MAE_f0 7.8 third cut
# 5 20140105 MAE_f0 1.9 first cut
# 6 20140106 MAE_f0 0.8 first cut
# Plotting
ggplot(df2, aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ cuts, nrow = 1)
Result:
Here is one way. You reshape your data. Then, you want to add a fake data point in this case. I noticed that there is no data point for MAE_f0 for (10,50](frequency 10-50). Combine your reshaped data and the fake data. When you draw a figure, use coord_cartesian with the range of y values in the original data set. Hope this gives you an ideal graphic. Here, your data is called mydf
library(dplyr)
library(tidyr)
library(ggplot2)
mydf <- structure(list(V1 = 1:30, V2 = 20140101:20140130, V3 = c(0.2,
1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4,
0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8,
0.3, 3, 9.9), V4 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2,
0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3,
0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("V1",
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -30L
))
ana <- select(mydf, -V1) %>%
rename(date = V2, MAE_f0 = V3, MAE_f1 = V4) %>%
gather(variable, value, -date) %>%
mutate(frequency = cut(value, breaks = c(-Inf,2,5,10,50)))
# Create a fake df
extra <- data.frame(date = 20140101,
variable = "MAE_f0",
value = 60,
frequency = "(10,50]")
new <- rbind(ana, extra)
ggplot(data = new, aes(x = frequency, y = value, fill = variable)) +
geom_boxplot(position = "dodge") +
coord_cartesian(ylim = range(ana$value) + c(-0.25, 0.25))