R Make scatter plots (ggplot) from columns based on attributes from rows - r

I have the following type of table :
df0 <- read.table(text = 'Sample Method Mg Al Ca Ti
Sa A 5.5 2.2 33 0.2
Sb A 4.2 1.2 44 0.1
Sc A 1.1 0.5 25 0.3
Sd A 3.3 1.3 31 0.5
Se A 6.2 0.2 55 0.6
Sa B 5.2 2 35 0.25
Sb B 4.6 1.3 48 0.1
Sc B 1.6 0.8 22 0.32
Sd B 3.1 1.6 29 0.4
Se B 6.8 0.3 51 0.7
Sa C 5.6 2.5 30 0.2
Sb C 4.1 1.2 41 0.15
Sc C 1 0.6 22 0.4
Sd C 3.2 1.5 30 0.5
Se C 6.8 0.1 51 0.65', header = T, stringsAsFactors = F)
Which include chemical compositions. I would like to use the Method A as a reference (X-axis) and to make automated scatter plots with the data from Method B, C in Y (with linear trend). With a reference line of 1:1 which would correspond to a perfect match.
In other words, I would like to produce plots like that :
I think a solution could start from transforming the data frame into:
df <- read.table(text = 'Sample Mg_A Al_A Ca_A Ti_A Mg_B Al_B Ca_B Ti_B Mg_C Al_C Ca_C Ti_C
Sa 5.5 2.2 33 0.2 5.2 2 35 0.25 5.6 2.5 30 0.2
Sb 4.2 1.2 44 0.1 4.6 1.3 48 0.1 4.1 1.2 41 0.15
Sc 1.1 0.5 25 0.3 1.6 0.8 22 0.32 1 0.6 22 0.4
Sd 3.3 1.3 31 0.5 3.1 1.6 29 0.4 3.2 1.5 30 0.5
Se 6.2 0.2 55 0.6 6.8 0.3 51 0.7 6.8 0.1 51 0.65
', header = T, stringsAsFactors = F)
But I don't know how to go further.
Any help would be appreciated.
Best, Anne-Christine

You can use the following code
library(tidyverse)
df0 %>%
pivot_wider(names_from = Method, values_from = c(Mg, Al, Ca, Ti)) %>%
pivot_longer(cols = -Sample) %>% #wide to long data format
separate(name, c("key","number"), sep = "_") %>%
group_by(number) %>% #Group the vaules according to number
mutate(row = row_number()) %>% #For creating unique IDs
pivot_wider(names_from = number, values_from = value) %>%
ggplot() +
geom_point(aes(x=A, y=B, color = "A vs B")) +
geom_point(aes(x=A, y=C, color = "A vs C")) +
geom_abline(slope=1, intercept=0) +
geom_smooth(aes(x=A, y=B, color = "A vs B"), method=lm, se=FALSE, fullrange=TRUE)+
geom_smooth(aes(x=A, y=C, color = "A vs C"), method=lm, se=FALSE, fullrange=TRUE)+
facet_wrap(key~., scales = "free")+
theme_bw()+
ylab("B or C") +
xlab("A")
Data
df0 = structure(list(Sample = c("Sa", "Sb", "Sc", "Sd", "Se", "Sa",
"Sb", "Sc", "Sd", "Se", "Sa", "Sb", "Sc", "Sd", "Se"), Method = c("A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C",
"C"), Mg = c(5.5, 4.2, 1.1, 3.3, 6.2, 5.2, 4.6, 1.6, 3.1, 6.8,
5.6, 4.1, 1, 3.2, 6.8), Al = c(2.2, 1.2, 0.5, 1.3, 0.2, 2, 1.3,
0.8, 1.6, 0.3, 2.5, 1.2, 0.6, 1.5, 0.1), Ca = c(33L, 44L, 25L,
31L, 55L, 35L, 48L, 22L, 29L, 51L, 30L, 41L, 22L, 30L, 51L),
Ti = c(0.2, 0.1, 0.3, 0.5, 0.6, 0.25, 0.1, 0.32, 0.4, 0.7,
0.2, 0.15, 0.4, 0.5, 0.65)), class = "data.frame", row.names = c(NA,
-15L))

Related

Replacing certain values in a column with words

My goal is to replace a specific column's numeric values into certain words based off of a range to use in a future categorical test. Im trying to change this dataframe below:
Lets call this data frame as DF
SubjectID
ColumnA
ColumnB
Column C
Subject1
38
2.3
2.1
Subject2
12
2.1
2.0
Subject3
1
1.1
1.9
Subject4
34
3.2
1.5
Subject5
1
1.7
1.5
Subject6
56
3.9
1.7
To achieve a dataframe such as the one here:
SubjectID
ColumnA
ColumnB
Column C
Subject1
Mid
2.3
2.1
Subject2
Low
2.1
2.0
Subject3
Low
1.1
1.9
Subject4
Mid
3.2
1.5
Subject5
Low
1.7
1.5
Subject6
High
3.9
1.7
So in this case, I want to only change columnA's value names based off of a specific range the data values lie in.
For this example:
A value of Low represents a value lower than 30.
A value of Mid represents a value between 30 and 50
A value of High represents a value higher than 50
What would be the best way to do this?
We could use case_when
library(dplyr)
DF <- DF %>%
mutate(ColumnA = case_when(ColumnA < 30 ~ "Low",
between(ColumnA, 30, 50) ~ "Mid", TRUE ~ "High"))
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7
Another convenient option without doing multiple expressions is cut from base R
cut(DF$ColumnA, breaks = c(-Inf, 30, 50, Inf), labels = c("Low", "Mid", "High"))
[1] Mid Low Low Mid Low High
Levels: Low Mid High
data
DF <- structure(list(SubjectID = c("Subject1", "Subject2", "Subject3",
"Subject4", "Subject5", "Subject6"), ColumnA = c(38L, 12L, 1L,
34L, 1L, 56L), ColumnB = c(2.3, 2.1, 1.1, 3.2, 1.7, 3.9), ColumnC = c(2.1,
2, 1.9, 1.5, 1.5, 1.7)), class = "data.frame", row.names = c(NA,
-6L))
If you prefer a base R solution you can use nested ifelse:
DF$ColumnA <- ifelse(DF$ColumnA < 30, "Low",
ifelse(DF$ColumnA >= 30 & DF$ColumnA <= 50, "Mid", "High"))
Result:
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7

How can I plot multiple columns under X and Y in ggplot2

data <- structure(list(A_w = c(0, 0.69, 1.41, 2.89, 6.42, 13.3, 25.5,
36.7, 44.3, 46.4), E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4,
4.4, 10.6, 16.5), A_e = c(0, 0.18, 0.37, 0.79, 1.93, 4.82, 11.4,
21.6, 31.1, 36.2), E_e = c(99.4, 99.3, 98.9, 98.4, 97.1, 93.3,
84.7, 71.5, 58.1, 48.7)), row.names = c(NA, -10L), class = "data.frame")
data
#> A_w E_w A_e E_e
#> 1 0.00 1.2 0.00 99.4
#> 2 0.69 1.2 0.18 99.3
#> 3 1.41 1.5 0.37 98.9
#> 4 2.89 1.6 0.79 98.4
#> 5 6.42 1.9 1.93 97.1
#> 6 13.30 2.3 4.82 93.3
#> 7 25.50 3.4 11.40 84.7
#> 8 36.70 4.4 21.60 71.5
#> 9 44.30 10.6 31.10 58.1
#> 10 46.40 16.5 36.20 48.7
Created on 2021-05-31 by the reprex package (v2.0.0)
I am trying to plot this data with all A values as X and Es as Y. How can I put either a) both of these columns plotted on a ggplot2, or b) rearrange this dataframe to combine the A columns and E columns into a final dataframe with only two columns with 2x as many rows as pictured?
Thanks for any help, I am a beginner (obviously)
Edit for Clarity: It's important that the A_e & E_e values remain as pairs, similar to how the A_w and E_w values remain as pairs. The end result plot should resemble the ORANGE and BLUE lines of this image, but I am trying to replicate this while learning R.
Currently I am capable of plotting each separately when dividing into two dataframes of 2x10
A_w E_w
1 0.00 1.2
2 0.69 1.2
3 1.41 1.5
4 2.89 1.6
5 6.42 1.9
6 13.30 2.3
7 25.50 3.4
8 36.70 4.4
9 44.30 10.6
10 46.40 16.5
and the second plot
# A tibble: 10 x 2
A_e E_e
<dbl> <dbl>
1 0 99.4
2 0.18 99.3
3 0.37 98.9
4 0.79 98.4
5 1.93 97.1
6 4.82 93.3
7 11.4 84.7
8 21.6 71.5
9 31.1 58.1
10 36.2 48.7
But my end goal is to have them both on the same plot, like in the Excel graph (orange + blue graph) above.
Here is a try
library(dplyr)
library(ggplot2)
line_1_data <- data %>%
select(A_w, E_w) %>%
mutate(xend = lead(A_w), yend = lead(E_w)) %>%
filter(!is.na(xend))
line_2_data <- data %>%
select(A_e, E_e) %>%
mutate(xend = lead(A_e), yend = lead(E_e)) %>%
filter(!is.na(xend))
# multiple column for with different geom
ggplot(data = data) +
# The blue line
geom_point(aes(x = A_w, y = E_w), color = "blue") +
geom_curve(data = line_1_data, aes(x = A_w, y = E_w, xend = xend,
yend = yend), color = "blue",
curvature = 0.02) +
# The orange line
geom_point(aes(x = A_e, y = E_e), color = "orange") +
geom_curve(data = line_2_data,
aes(x = A_e, y = E_e, xend = xend, yend = yend), color = "orange",
curvature = -0.02) +
# The red connection between two line
geom_curve(data = tail(data, 1),
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0.1,
color = "red") +
# The black straight line between pair
geom_curve(
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0,
color = "black")
Created on 2021-05-31 by the reprex package (v2.0.0)
You may try from this
data <- data.frame(
A_w = c(0,0.69,1.41,2.89,6.42,
13.3,25.5,36.7,44.3,46.4),
E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4, 4.4, 10.6, 16.5),
A_e = c(0,0.18,0.37,0.79,1.93,
4.82,11.4,21.6,31.1,36.2),
E_e = c(99.4,99.3,98.9,98.4,
97.1,93.3,84.7,71.4,58.1,48.7)
)
library(tidyverse)
data %>% pivot_longer(everything(), names_sep = '_', names_to = c('.value', 'type')) %>%
ggplot(aes(x = A, y = E, color = type)) +
geom_point() +
geom_line()
Created on 2021-05-31 by the reprex package (v2.0.0)
Doing it "by hand":
#dummmy data:
df = data.frame(A_w=rnorm(10), E_w=rnorm(10), A_e=rnorm(10), E_e=rnorm(10))
df2 = data.frame(A=c(df$A_w, df$A_e), E=c(df$E_w, df$A_e))
Output:
> df2
A E
1 1.25522468 -0.2441768
2 -0.50585191 -0.1383637
3 0.42374270 -0.9664189
4 -0.39858532 -0.3442157
5 -1.05665363 -1.3574362
6 0.79191788 -0.8202841
7 -1.31349592 0.7280619
8 -0.05609851 0.6365495
9 1.01068811 2.0222241
10 -1.15572972 -0.2190794
11 0.15579931 0.1557993
12 1.58834329 1.5883433
13 1.24933622 1.2493362
14 -0.28197439 -0.2819744
15 0.30593184 0.3059318
16 0.75486103 0.7548610
17 1.19394302 1.1939430
18 -1.79955846 -1.7995585
19 0.59688655 0.5968865
20 0.71519048 0.7151905
And for the plot: ggplot(df2, aes(x=A, y=E)) + geom_point()
Output:
There are ways to do this without having to joint the columns by listing their names - with the tidyr package - but i think that this solution is easier to understand from a beginners pov.

Getting p values for groupwise correlation using the dplyr package

I am trying run correlations between some variables in a dataframe. I have one character vector (group) and rest are numeric.
dataframe<-
Group V1 V2 V3 V4 V5
NG -4.5 3.5 2.4 -0.5 5.5
NG -5.4 5.5 5.5 1.0 2.0
GL 2.0 1.5 -3.5 2.0 -5.5
GL 3.5 6.5 -2.5 1.5 -2.5
GL 4.5 1.5 -6.5 1.0 -2.0
Following is my code:
library(dplyr)
dataframe %>%
group_by(Group) %>%
summarize(COR=cor(V3,V4))
Here is my output:
Group COR
<chr> <dbl>
1 GL 0.1848529
2 NG 0.1559912
How do i use edit this code to get the p-values? Any help would be appreciated! I have looked elsewhere but nothing is working. Thanks!!
You should try ?corrplot if you want to see pairwise correlation
library(corrplot)
df_cor <- cor(df[,sapply(df, is.numeric)])
corrplot(df_cor, method="color", type="upper", order="hclust")
In below graph you can notice that 'positive correlations' are displayed in 'blue' and 'negative correlations' in 'red' color and it's intensity are proportional to the correlation coefficients.
#sample data
> dput(df)
structure(list(Group = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("GL",
"NG"), class = "factor"), V1 = c(-4.5, -5.4, 2, 3.5, 4.5), V2 = c(3.5,
5.5, 1.5, 6.5, 1.5), V3 = c(2.4, 5.5, -3.5, -2.5, -6.5), V4 = c(-0.5,
1, 2, 1.5, 1), V5 = c(5.5, 2, -5.5, -2.5, -2)), .Names = c("Group",
"V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-5L))

non-numeric argument running Bootstrap in R

I am trying to bootstrap my data , sample of it is below
AveOn AveOff AveLd DWELL_SEC
0.3 0.1 5.9 14
0.3 0.1 5.9 17
0.3 0.1 5.9 9
1.1 1.5 25.3 21
1.1 1.5 25.3 159
1.1 1.5 25.3 14
1.1 1.5 25.3 13
1.1 1.5 25.3 18
1.1 1.5 25.3 26
1.1 1.5 25.3 19
1.1 1.5 25.3 17
1.1 1.5 25.3 24
1.1 1.5 25.3 27
I wrote the following code
library(xlsx)
library(bootstrap)
rawData <- read.xlsx("9660.xlsx")
load<-function(AveLd,AveOff,AveOn,DWELL_SEC)sum((AveLd-AveOff)+AveOn)
bootstrap(rawData,load,10000,replace=true)
I kept Getting this Error
Error in n * nboot : non-numeric argument to binary operator
is there a way to solve it
appreciated your time and help
You are messing up the arguments...
bootstrap(rawData, func=load, nboot=10000, replace=TRUE)
For further information have a look at the function help
?bootstrap
This is your data:
rawData = structure(list(AveOn = c(0.3, 0.3, 0.3, 1.1, 1.1, 1.1, 1.1, 1.1,
1.1, 1.1, 1.1, 1.1, 1.1), AveOff = c(0.1, 0.1, 0.1, 1.5, 1.5,
1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5), AveLd = c(5.9, 5.9,
5.9, 25.3, 25.3, 25.3, 25.3, 25.3, 25.3, 25.3, 25.3, 25.3, 25.3
), DWELL_SEC = c(14L, 17L, 9L, 21L, 159L, 14L, 13L, 18L, 26L,
19L, 17L, 24L, 27L)), class = "data.frame", row.names = c(NA,
-13L))
If you want to use the package bootstrap, then you bootstrap the indices of the data frame, and provide the data frame as an additional argument:
func = function(x,data){
with(data[x,],sum((AveLd-AveOff)+AveOn))
}
bootstrap(1:nrow(rawData),nboot=1000,theta=func,data=rawData)
library(xlsx)
library(bootstrap)
rawData <- read.xlsx("C:\\Users\\TAQWA\\Downloads\\9660.xlsx",1)
#load<-function(AveLd,AveOff,AveOn,DWELL_SEC)
# + + sum((AveLd-AveOff)+AveOn)
#bootstrap(rawData,10000,load())
three_d_array <- array(0,dim = c(270, 6, 20))
for (i in 1:20){
candy = 1:nrow(rawData)
B=sample(candy,nrow(rawData) , replace=T)
a=rawData[B,]
three_d_array[,,i]=as.matrix(a)
}

R: box plot with 2 or more series

My data frame is simple (and probably is not strictly a dataframe):
date MAE_f0 MAE_f1
1 20140101 0.2 0.2
2 20140102 1.9 0.1
3 20140103 0.1 0.3
4 20140104 7.8 15.9
5 20140105 1.9 4.6
6 20140106 0.8 0.8
7 20140107 0.5 0.6
8 20140108 0.2 0.2
9 20140109 0.2 0.2
10 20140110 0.8 1.1
11 20140111 0.2 0.2
12 20140112 0.4 0.4
13 20140113 2.8 0.9
14 20140114 5.4 5.8
15 20140115 0.2 0.3
16 20140116 4.9 3.1
17 20140117 3.7 6.0
18 20140118 1.4 2.1
19 20140119 0.9 3.0
20 20140120 0.2 3.6
21 20140121 0.3 0.3
22 20140122 0.4 0.4
23 20140123 0.6 1.7
24 20140124 6.1 4.7
25 20140125 0.1 0.0
26 20140126 7.4 4.9
27 20140127 0.8 0.9
28 20140128 0.3 0.3
29 20140129 3.0 4.2
30 20140130 9.9 17.3
On every day I've 2 variables: MAE for f0, and MAE for f1.
I can calculate frequency for my 2 variables on the whole time period using "cut" with the same intervals for both:
cut(mae.df$MAE_f0,c(0,2,5,10,50))
cut(mae.df$MAE_f1,c(0,2,5,10,50))
Well. Now I can use boxplot to plot variable versus it's frequency distribution:
boxplot(mae.df$MAE_f0~cut(mae.df$MAE_f0,c(0,2,5,10,50)))
boxplot(mae.df$MAE_f1~cut(mae.df$MAE_f1,c(0,2,5,10,50)))
The produced boxplot (2) are very simple (but I don't show it 'cause I've ho "reputation"): on x there are the intervals of frequency (0-2,2-5,5-10,10-50), on y the boxplot value for variable MAE_f0 for each interval.
Well, the question is very trivial: I'd like to have only one box plot, with both variables MAE_f0 and MAE_f1 and it's frequency distribution: I'd like to have is a plot with 2 boxplot for each frequency interval (I mean: 2 for 0-2, 2 for 2-5 and so on).
I know that my knowledge on R, data frame and so on is very poor, and, de facto, I'm missing something important about those arguments, specially on data frame and reshaping! Sorry in advance for that!But I've seen some nice examples in stackoverflow about grouping boxplot, all without time variable, and I'm not able to figure out how I can adjust my data frame for doing that.
I hope my question is not misplaced: sorry again for that.
Umbe
Here is how I would do this. I think it makes sense to melt your data first. A quick tutorial on melting your data is available here.
# First, make this reproducible by using dput for the data frame
df <- structure(list(date = 20140101:20140130, MAE_f0 = c(0.2, 1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4, 0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8, 0.3, 3, 9.9), MAE_f1 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2, 0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3, 0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("date", "MAE_f0", "MAE_f1"), row.names = c(NA, -30L), class = "data.frame")
require(ggplot2)
require(reshape2)
# Melt the original data frame
df2 <- melt(df, measure.vars = c("MAE_f0", "MAE_f1"))
head(df2)
# date variable value
# 1 20140101 MAE_f0 0.2
# 2 20140102 MAE_f0 1.9
# 3 20140103 MAE_f0 0.1
# 4 20140104 MAE_f0 7.8
# 5 20140105 MAE_f0 1.9
# 6 20140106 MAE_f0 0.8
# Create a "cuts" variable with the correct breaks
df2$cuts <- cut(df2$value,
breaks = c(-Inf, 2, 5, 10, +Inf),
labels = c("first cut", "second cut", "third cut", "fourth cut"))
head(df2)
# date variable value cuts
# 1 20140101 MAE_f0 0.2 first cut
# 2 20140102 MAE_f0 1.9 first cut
# 3 20140103 MAE_f0 0.1 first cut
# 4 20140104 MAE_f0 7.8 third cut
# 5 20140105 MAE_f0 1.9 first cut
# 6 20140106 MAE_f0 0.8 first cut
# Plotting
ggplot(df2, aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ cuts, nrow = 1)
Result:
Here is one way. You reshape your data. Then, you want to add a fake data point in this case. I noticed that there is no data point for MAE_f0 for (10,50](frequency 10-50). Combine your reshaped data and the fake data. When you draw a figure, use coord_cartesian with the range of y values in the original data set. Hope this gives you an ideal graphic. Here, your data is called mydf
library(dplyr)
library(tidyr)
library(ggplot2)
mydf <- structure(list(V1 = 1:30, V2 = 20140101:20140130, V3 = c(0.2,
1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4,
0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8,
0.3, 3, 9.9), V4 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2,
0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3,
0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("V1",
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -30L
))
ana <- select(mydf, -V1) %>%
rename(date = V2, MAE_f0 = V3, MAE_f1 = V4) %>%
gather(variable, value, -date) %>%
mutate(frequency = cut(value, breaks = c(-Inf,2,5,10,50)))
# Create a fake df
extra <- data.frame(date = 20140101,
variable = "MAE_f0",
value = 60,
frequency = "(10,50]")
new <- rbind(ana, extra)
ggplot(data = new, aes(x = frequency, y = value, fill = variable)) +
geom_boxplot(position = "dodge") +
coord_cartesian(ylim = range(ana$value) + c(-0.25, 0.25))

Resources