R - ordering in boxplot - r

I am trying to produce a series of box plots in R that is grouped by 2 factors. I've managed to make the plot, but I cannot get the boxes to order in the correct direction.
My data farm I am using looks like this:
Nitrogen Species Treatment
2 G L
3 R M
4 G H
4 B L
2 B M
1 G H
I tried:
boxplot(mydata$Nitrogen~mydata$Species*mydata$Treatment)
this ordered the boxes alphabetically (first three were the "High" treatments, then within those three they were ordered by species name alphabetically).
I want the box plot ordered Low>Medium>High then within each of those groups G>R>B for the species.
So i tried using a factor in the formula:
f = ordered(interaction(mydata$Treatment, mydata$Species),
levels = c("L.G","L.R","L.B","M.G","M.R","M.B","H.G","H.R","H.B")
then:
boxplot(mydata$Nitrogen~f)
however the boxes are still shoeing up in the same order. The labels are now different, but the boxes have not moved.
I have pulled out each set of data and plotted them all together individually:
lg = mydata[mydata$Treatment="L" & mydata$Species="G", "Nitrogen"]
mg = mydata[mydata$Treatment="M" & mydata$Species="G", "Nitrogen"]
hg = mydata[mydata$Treatment="H" & mydata$Species="G", "Nitrogen"]
etc ..
boxplot(lg, lr, lb, mg, mr, mb, hg, hr, hb)
This gives what i want, but I would prefer to do this in a more elegant way, so I don't have to pull each one out individually for larger data sets.
Loadable data:
mydata <-
structure(list(Nitrogen = c(2L, 3L, 4L, 4L, 2L, 1L), Species = structure(c(2L,
3L, 2L, 1L, 1L, 2L), .Label = c("B", "G", "R"), class = "factor"),
Treatment = structure(c(2L, 3L, 1L, 2L, 3L, 1L), .Label = c("H",
"L", "M"), class = "factor")), .Names = c("Nitrogen", "Species",
"Treatment"), class = "data.frame", row.names = c(NA, -6L))

The following commands will create the ordering you need by rebuilding the Treatment and Species factors, with explicit manual ordering of the levels:
mydata$Treatment = factor(mydata$Treatment,c("L","M","H"))
mydata$Species = factor(mydata$Species,c("G","R","B"))
edit 1 : oops I had set it to HML instead of LMH. fixing.
edit 2 : what factor(X,Y) does:
If you run factor(X,Y) on an existing factor, it uses the ordering of the values in Y to enumerate the values present in the factor X. Here's some examples with your data.
> mydata$Treatment
[1] L M H L M H
Levels: H L M
> as.integer(mydata$Treatment)
[1] 2 3 1 2 3 1
> factor(mydata$Treatment,c("L","M","H"))
[1] L M H L M H <-- not changed
Levels: L M H <-- changed
> as.integer(factor(mydata$Treatment,c("L","M","H")))
[1] 1 2 3 1 2 3 <-- changed
It does NOT change what the factor looks like at first glance, but it does change how the data is stored.
What's important here is that many plot functions will plot the lowest enumeration leftmost, followed by the next, etc.
If you create factors simply using factor(X) then usually the enumeration is based upon the alphabetical order of the factor levels, (e.g. "H","L","M"). If your labels have a conventional ordering different from alphabetical (i.e. "H","M","L"), this can make your graphs seems strange.
At first glance, it may seem like the problem is due to the ordering of data in the data frame - i.e. if only we could place all "H" at the top and "L" at the bottom, then it would work. It doesn't. But if you want your labels to appear in the same order as the first occurrence in the data, you can use this form:
mydata$Treatment = factor(mydata$Treatment, unique(mydata$Treatment))

This earlier StackOverflow question shows how to reorder a boxplot based on a numerical value; what you need here is probably just a switch from factor to the related type ordered. But it is hard say as we do not have your data and you didn't provide a reproducible example.
Edit Using the dataset you posted in variable md and relying on the solution I pointed to earlier, we get
R> md$Species <- ordered(md$Species, levels=c("G", "R", "B"))
R> md$Treatment <- ordered(md$Treatment, levels=c("L", "M", "H"))
R> with(md, boxplot(Nitrogen ~ Species * Treatment))
which creates the chart you were looking to create.
This is also equivalent to the other solution presented here.

Related

Plotting Arbitrary Functions by Group in R

I have a dataset (test_df) that looks like:
Species
TreatmentA
TreatmentB
X0
L
K
Apple
Hot
Cloudy
1
2
3
Apple
Cold
Cloudy
4
5
6
Orange
Hot
Sunny
7
8
9
Orange
Cold
Sunny
10
11
12
I would like to display the effect of the treatments by using the X0, L, and K values as coefficients in a standard logistic function and plotting the same species across various treatments on the same plot. I would like a grid of plots with the logistic curves for each species on it's own plots, with each treatment then being grouped by color within every plot. In the above example, Plot1.Grid1 would have 2 logistic curves corresponding to Apple Hot and Apple Cold, and plot1.Grid2 would have 2 logistic curves corresponding to Orange Hot and Orange Cold.
The below code will create a single logistic function curve which can then be layered, but manually adding the layers for multiple treatments is tedious.
testx0 <- 1
testL <- 2
testk <- 3
days <- seq(from = -5, to = 5, by = 1)
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
ggplot()+aes(x = days, y = functionmultitest(days,testL,testk,testx0))+geom_line()
The method described in (https://statisticsglobe.com/draw-multiple-function-curves-to-same-plot-in-r) works for dataframes with few species or treatments, but it becomes very tedious to individually define the curves if you have many treatments/species. Is there a way to programatically pass the list of coefficients and have ggplot handle the grouping?
Thank you!
Your current code shows how to compute the curve for a single row in your data frame. What you can do is pre-compute the curve for each row and then feed to ggplot.
Setup:
# Packages
library(ggplot2)
# Your days vector
days <- seq(from = -5, to = 5, by = 1)
# Your sample data frame above
df = structure(list(Species = c("Apple", "Apple", "Orange", "Orange"
), TreatmentA = c("Hot", "Cold", "Hot", "Cold"), TreatmentB = c("Cloudy",
"Cloudy", "Sunny", "Sunny"), X0 = c(1L, 4L, 7L, 10L), L = c(2L,
5L, 8L, 11L), K = c(3L, 6L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-4L))
# Your function
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
We'll "expand" each row of your data frame with the days vector:
# Define first a data frame of days:
days_df = data.frame(days = days)
# Perform a cross join
df_all = merge(days_df, df, all = T)
At this point, you will have a data frame where each original row is duplicated for as many days you have.
Now, just as you did for one row, we'll compute the value of the function for each row and store in the df_all as result:
df_all$result = mapply(functionmultitest, df_all$days, df_all$L, df_all$K, df_all$X0)
I'm not sure how you intended to handle treatmentA and treatmentB, so I'll just combine for illustration purposes:
df_all$combined_treatment = paste0(df_all$TreatmentA, "-", df_all$TreatmentB)
We can now feed this data frame to ggplot, set the color to be combined_treatment, and use the facet_grid function to split by species
ggplot(data = df_all, aes(x = days, y = result, color = combined_treatment))+
geom_line() +
facet_grid(Species ~ ., scales = "free")
The result is as follows:

Data manipulations in R

As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Calculating subtotals (sum, stdev, average etc)

I have been searching for this for a while, but haven't been able to find a clear answer so far. Probably have been looking for the wrong terms, but maybe somebody here can quickly help me. The question is kind of basic.
Sample data set:
set <- structure(list(VarName = structure(c(1L, 5L, 4L, 2L, 3L),
.Label = c("Apple/Blue/Nice",
"Apple/Blue/Ugly", "Apple/Pink/Ugly", "Kiwi/Blue/Ugly", "Pear/Blue/Ugly"
), class = "factor"), Color = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Blue",
"Pink"), class = "factor"), Qty = c(45L, 34L, 46L, 21L, 38L)), .Names = c("VarName",
"Color", "Qty"), class = "data.frame", row.names = c(NA, -5L))
This gives a data set like:
set
VarName Color Qty
1 Apple/Blue/Nice Blue 45
2 Pear/Blue/Ugly Blue 34
3 Kiwi/Blue/Ugly Blue 46
4 Apple/Blue/Ugly Blue 21
5 Apple/Pink/Ugly Pink 38
What I would like to do is fairly straight forward. I would like to sum (or averages or stdev) the Qty column. But, also I would like to do the same operation under the following conditions:
VarName includes "Apple"
VarName includes "Ugly"
Color equals "Blue"
Anybody that can give me a quick introduction on how to perform this kind of calculations?
I am aware that some of it can be done by the aggregate() function, e.g.:
aggregate(set[3], FUN=sum, by=set[2])[1,2]
However, I believe that there is a more straight forward way of doing this then this. Are there some filters that can be added to functions like sum()?
The easiest way to to split up your VarName column, then subsetting becomes very easy. So, lets create an object were varName has been separated:
##There must(?) be a better way than this. Anyone?
new_set = t(as.data.frame(sapply(as.character(set$VarName), strsplit, "/")))
Brief explanation:
We use as.character because set$VarName is a factor
sapply takes each value in turn and applies strplit
The strsplit function splits up the elements
We convert to a data frame
Transpose to get the correct rotation
Next,
##Convert to a data frame
new_set = as.data.frame(new_set)
##Make nice rownames - not actually needed
rownames(new_set) = 1:nrow(new_set)
##Add in the Qty column
new_set$Qty = set$Qty
This gives
R> new_set
V1 V2 V3 Qty
1 Apple Blue Nice 45
2 Pear Blue Ugly 34
3 Kiwi Blue Ugly 46
4 Apple Blue Ugly 21
5 Apple Pink Ugly 38
Now all the operations are as standard. For example,
##Add up all blue Qtys
sum(new_set[new_set$V2 == "Blue",]$Qty)
[1] 146
##Average of Blue and Ugly Qtys
mean(new_set[new_set$V2 == "Blue" & new_set$V3 == "Ugly",]$Qty)
[1] 33.67
Once it's in the correct form, you can use ddply which does every you want (and more)
library(plyr)
##Split the data frame up by V1 and take the mean of Qty
ddply(new_set, .(V1), summarise, m = mean(Qty))
##Split the data frame up by V1 & V2 and take the mean of Qty
ddply(new_set, .(V1, V2), summarise, m = mean(Qty))
Is this what you're looking for?
# sum for those including 'Apple'
apple <- set[grep('Apple', set[, 'VarName']), ]
aggregate(apple[3], FUN=sum, by=apple[2])
Color Qty
1 Blue 66
2 Pink 38
# sum for those including 'Ugly'
ugly <- set[grep('Ugly', set[, 'VarName']), ]
aggregate(ugly[3], FUN=sum, by=ugly[2])
Color Qty
1 Blue 101
2 Pink 38
# sum for Color==Blue
sum(set[set[, 'Color']=='Blue', 3])
[1] 146
The last sum could be done by using subset
sum(subset(set, Color=='Blue')[,3])

Collapse columns by grouping variable (in base)

I have a text variable and a grouping variable. I'd like to collapse the text variable into one string per row (combine) by factor. So as long as the group column says m I want to group the text together and so on. I provided a sample data set before and after. I am writing this for a package and have thus far avoided all reliance on other packages except for wordcloudand would like to keep it this way.
I suspect rle may be useful with cumsum but haven't been able to figure this one out.
Thank you in advance.
What the data looks like
text group
1 Computer is fun. Not too fun. m
2 No its not, its dumb. m
3 How can we be certain? f
4 There is no way. m
5 I distrust you. m
6 What are you talking about? f
7 Shall we move on? Good then. f
8 Im hungry. Lets eat. You already? m
What I'd like the data to look like
text group
1 Computer is fun. Not too fun. No its not, its dumb. m
2 How can we be certain? f
3 There is no way. I distrust you. m
4 What are you talking about? Shall we move on? Good then. f
5 Im hungry. Lets eat. You already? m
The Data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
EDIT: I found I can add unique column for each run of the group variable with:
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Yielding:
text group new
1 Computer is fun. Not too fun. m 1
2 No its not, its dumb. m 1
3 How can we be certain? f 2
4 There is no way. m 3
5 I distrust you. m 3
6 What are you talking about? f 4
7 Shall we move on? Good then. f 4
8 Im hungry. Lets eat. You already? m 5
This makes use of rle to create an id to group the sentences on. It uses tapply along with paste to bring the output together
## Your example data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on?  Good then.",
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
answer <- data.frame(text = out, group = levels(dat$group)[k$val])
I got the answer and came back to post but Dason beat me to it and more understandably than my own.
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Paste <- function(x) paste(x, collapse=" ")
aggregate(text~new, dat, Paste)
EDIT
How I'd do it with aggregate and what I learned from your response (though tapply is a better solution):
y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))
text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])

Resources