interpreting R native boxplot function - r

I'm very new to R. I'm trying on the native boxplot function, using ~ shall combine different variables on the X axis.
My book gives two examples
boxplot(len ~ supp, data = ToothGrowth)
and
boxplot(len ~ supp + dose, data = ToothGrowth)
I do understand the first one, but what does + in boxplot(len ~ supp + dose, data = ToothGrowth) do? The output is confusing for me (shown below).

In the second instance len ~ sup + dose is the equivalent of doing:
TG_split <- with(
ToothGrowth,
split(len, list(supp, dose)
)
)
boxplot(TG_split)
i.e. it splits the len vector by the two factors supp and dose, and gives you the values of len for every combination of the two factors.
TG_split
$OJ.0.5
[1] 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7
$VC.0.5
[1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0
$OJ.1
[1] 19.7 23.3 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3
$VC.1
[1] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5
$OJ.2
[1] 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0
$VC.2
[1] 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5

Related

Efficiently find groups of rows in a data frame with almost identical values

I have a data set of about a thousand rows, but it will grow.
I want to find groups of rows where a certain number of attributes are almost identical.
I can do a stupid brute-force search, but it's really slow and I'm sure R can do this much better.
Test data:
df = read.table(text="
Date Time F1 F2 F3 F4 F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)
#Initialize column to hold clustering info
df$cluster = NA
#Maximum tolerance for a match between rows
toler=1.0
#Brute force search, very ugly.
for(i in 1:(nrow(df)-1)){
if(is.na(df$cluster[i])){
df$cluster[i] <- i
for(j in (i+1):nrow(df)){
if(max(abs(df[i,3:7] - df[j,3:7]))<toler){
df$cluster[j]<-i
}
}
}
}
if(is.na(df$cluster[j])){df$cluster[j] <- j}
df
Expected output:
Date Time F1 F2 F3 F4 F5 Conc cluster
1 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1 1
2 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4 2
3 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0 1
4 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8 4
Here is an option using data.table:
df[, (cols) := c(.SD-1, .SD+1), .SDcols=fcols]
onstr <- c(paste0(fcols,">",cols[rng]),paste0(fcols,"<",cols[5+rng]))
df[, cluster := df[df, on=onstr, by=.EACHI, min(x.rn)]$V1]
output:
Date Time F1 F2 F3 F4 F5 Conc rn lower1 lower2 lower3 lower4 lower5 upper1 upper2 upper3 upper4 upper5 cluster
1: 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1 1 -0.9 16.6 13.2 18.5 17.6 1.1 18.6 15.2 20.5 19.6 1
2: 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4 2 -1.0 27.7 12.7 14.6 15.2 1.0 29.7 14.7 16.6 17.2 2
3: 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0 3 -1.0 16.0 13.0 18.0 17.0 1.0 18.0 15.0 20.0 19.0 1
4: 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8 4 -1.0 24.5 11.6 13.9 15.6 1.0 26.5 13.6 15.9 17.6 4
data:
df = read.table(text="
Date Time F1 F2 F3 F4 F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)
library(data.table)
setDT(df)[, rn := .I]
toler <- 1.0
rng <- 1L:5L
fcols <- paste0("F", rng)
cols <- do.call(paste0, CJ(c("lower", "upper"), rng))
Explanation:
L behind an integer is telling R this is of type integer (see Why would R use the "L" suffix to denote an integer?)
df[df, on=onstr, by=.EACHI, min(x.rn)] is performing a non-equi self-join using the inequalities specified in onstr.
$V1 access the V1 column of the above join (naming is defaulted to V* when not provided)
df[, cluster := my_result] updates the original data.table by reference so that its faster when dataset is large. Improvement is due to the fact that there is no need to do a deep copy of the original data.frame as in base R.
since we are performing a join, there is a left table and right table. In data.table lingo, for x[i, on=join_keys], x is right table and i is left table (inspiration comes from base R syntax). Hence, x. is used to refer to access columns in the left table like a SQL syntax and similarly for i.. More details can be found in the data.table vignettes. (see https://cran.r-project.org/web/packages/data.table/)

R: reshape and dcast confusion

Having worked on this for a few hours, and looking through all the dcast answers and tutorials I could find. I think I'd better ask instead.
Example data first:
xx = data.frame(SOIL = c(rep("kraz", 20), (rep("sodo", 20))),
DM = runif(0, 20,n=40),
cutnum = c(rep(1:4,10)))
Now, the required operation. I'd like to end up with a table with Soil names on the row, and Cutnum as column names, with the DM values in the columns under the cutnumbers.
# Soil 1 2 3 4
# kraz 1.2 19 12.1 9.9
# kraz 15.3 4.5 9.2 12.1
# kraz 14 15.2 5.2 15.4
# kraz 18.5 0.7 14.3 5
# kraz 17.1 15.8 2.9 9.5
# kraz 13 14.4 4.9 8.6
# kraz 3 10.2 3.5 14
# kraz 17.7 8.6 10.6 16.1
# kraz 12.6 1.7 2.2 17.5
# kraz 3.8 16.7 4.8 0.4
# kraz 4.1 17.1 12.5 14.5
# kraz 17.8 5.2 11.2 9.5
# kraz 12.3 2.2 4.8 8.7
# kraz 7.3 3 10.2 1.6
# kraz 11.3 12.2 13.4 10.2
# kraz 7.5 15.9 8.9 18.3
# kraz 15 5 19.6 16.5
# sodo 8.4 2.6 18.3 15.1
# sodo 6.9 19.7 6.5 8.4
# sodo 4 6.5 4.2 11.9
# sodo 0.8 12 18.3 15.4
# sodo 7.2 11.9 6.7 4.7
# sodo 2.6 4.4 13.8 13.7
# sodo 11.3 16.4 12.3 9.6
# sodo 5.6 17.1 11.4 16.7
# sodo 10.4 4.7 5.7 10.6
# sodo 8.7 5.6 1.1 4.8
# sodo 19.2 14.8 7 7
# sodo 18.6 9 14.9 5
# sodo 4.3 2.4 0.3 11.1
# sodo 4.9 18.4 19.5 9.7
# sodo 18.8 3.3 15.9 12.7
# sodo 19.7 0.1 13.6 3
# sodo 11.3 11.1 6.6 9.5
# sodo 8.1 11.3 10.1 3.5
# sodo 14.1 13.5 0.5 17.2
# sodo 16.8 15.6 16.2 17.3
I've tried:
require(reshape2)
dcast(xx, formula = SOIL ~ Cutnum, value.var = xx$DM)
Which produces the following:
Error: value.var (18.943128376267911.662011714652217.372458190657214.7615862498069.991016136482364.527641483582569.107870771549641.0582387680187810.695438273251115.275471545755917.17561007011680.9180781804025171.6045100009068812.556012449786118.57340626884257.465867823921144.576288489624868.3055530954152315.88032039348041.3668353855609916.888091783039310.544018512591714.95763068087410.46029578894380.95387450419366410.41133180726329.8472668603062618.449066961184111.24195748940114.0428098617121613.89849389437598.8408243283629416.336669707670818.53340925183156.113082133233555.875797253102060.06016504485160119.295095484703784.955938421189798.97169086616486) not found in input
In addition: Warning message:
In if (!(value.var %in% names(data))) { :
the condition has length > 1 and only the first element will be used
I'd greatly appreciate a suggestion that will get a more useful result.
Based on the comments, it seems OP requires a wide format ordered by the row with no function aggregation.
This should do it,
set.seed(123)
xx = data.frame(SOIL = c(rep("kraz", 20), (rep("sodo", 20))),
DM = runif(0, 20,n=40),
cutnum = c(rep(1:4,10)))
require(reshape2)
xx$t <- rep(1:10, each=4) # Add column to identify subset
dcast(xx, SOIL+t~cutnum, value.var="DM")[, -2] # Remove new column

How to change where horizontal axis cross in geom_bar (ggplot2) for each bar? [duplicate]

I want to make bar charts where the bar minimum can be specified (much like the box in a box and whisker plot). Can barplot do that? I suspect the answer's in ggplot, but I can't find an example.
Here's some data:
X Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 Highest recorded 31.5 31.8 30.3 28.0 24.9 24.4 21.7 20.9 24.5 25.4 26.0 28.7
2 Mean monthly maximum 27.8 28.6 27.0 24.8 22.0 20.0 18.9 18.8 20.4 22.4 23.9 26.8
3 Mean daily maximum 24.2 24.8 23.1 20.9 18.4 16.3 15.5 15.7 16.9 18.3 20.0 22.4
4 Mean 19.1 19.8 18.1 16.2 13.8 11.9 11.2 11.6 12.7 14.1 15.7 17.7
5 Mean daily minimum 14.0 14.7 13.1 11.4 9.2 7.5 6.9 7.4 8.4 10.0 11.4 13.0
6 Mean monthly minimum 7.6 9.1 6.8 3.8 2.3 -0.5 -0.2 1.0 2.3 3.7 5.3 6.7
7 Lowest recorded 4.0 5.6 4.1 -1.3 0.0 -3.1 -2.6 -1.4 -0.8 2.0 2.7 4.1
xaxis =c("J" ,"F" ,"M" ,"A" ,"M" ,"J","J","A", "S", "O","N","D")
So ideally, I end up with a stacked bar for each month, that starts at the 'Lowest recorded' value, rather than at zero.
I've also had a try with superbarplot from the UsingR package. I can get the bars to start where I want, but can't move the x axis down out of the centre of the plot. Thanks in advance.
You can use geom_boxplot in ggplot2 to get what (I think) you want specifying the precomputed values and stat = 'identity' and use geom_crossbar to put in the other
# first, your data
weather <- read.table(text = 'X Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 "Highest recorded" 31.5 31.8 30.3 28.0 24.9 24.4 21.7 20.9 24.5 25.4 26.0 28.7
2 "Mean monthly maximum" 27.8 28.6 27.0 24.8 22.0 20.0 18.9 18.8 20.4 22.4 23.9 26.8
3 "Mean daily maximum" 24.2 24.8 23.1 20.9 18.4 16.3 15.5 15.7 16.9 18.3 20.0 22.4
4 "Mean" 19.1 19.8 18.1 16.2 13.8 11.9 11.2 11.6 12.7 14.1 15.7 17.7
5 "Mean daily minimum" 14.0 14.7 13.1 11.4 9.2 7.5 6.9 7.4 8.4 10.0 11.4 13.0
6 "Mean monthly minimum" 7.6 9.1 6.8 3.8 2.3 -0.5 -0.2 1.0 2.3 3.7 5.3 6.7
7 "Lowest recorded" 4.0 5.6 4.1 -1.3 0.0 -3.1 -2.6 -1.4 -0.8 2.0 2.7 4.1', header =T)
library(reshape2)
library(ggplot2)
# reshape to wide format (basically transposing the data.frame)
w <- dcast(melt(weather), variable~X)
ggplot(w, aes(x=variable,ymin = `Lowest recorded`,
ymax = `Highest recorded`, lower = `Lowest recorded`,
upper = `Highest recorded`, middle = `Mean daily maximum`)) +
geom_boxplot(stat = 'identity') +
xlab('month') +
ylab('Temperature') +
geom_crossbar(aes(y = `Mean monthly maximum` ))+
geom_crossbar(aes(y = `Mean monthly minimum`)) +
geom_crossbar(aes(y = `Mean daily maximum` ))+
geom_crossbar(aes(y = `Mean daily minimum`))
This is partially described in an example in the help for geom_boxplot

Draw regression line per row in R

I have the following data.
HEIrank1
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
12 TH 7.4 10.0 5.8 8.8 8.7 8.6
13 CC 12.1 11.0 12.2 12.1 14.9 15.0
14 MM 11.7 24.2 18.4 18.6 31.9 31.7
15 MC 19.0 13.7 17.0 20.4 20.5 12.1
16 SH 11.4 24.8 26.1 12.7 19.9 25.9
17 SB 13.0 22.8 15.9 17.6 17.2 9.6
18 SN 11.5 18.6 22.9 12.0 20.3 11.6
19 ER 10.8 13.2 20.0 11.0 14.9 14.2
20 SL 44.9 21.6 21.3 26.5 17.0 8.0
I try following commends to draw regression line for each HEIs.
year <- c(2007 , 2008 , 2009 , 2010 , 2011, 2012)
op <- as.numeric(HEIrank1[1,])
lm.r <- lm(op~year)
plot(year, op)
abline(lm.r)
I want to draw to draw regression line for each college in one graph and I do not how.can you help me.
Here's my approach with ggplot2 but the graph is uninterpretable with that many lines.
library(ggplot2);library(reshape2)
mdat <- melt(HEIrank1, variable.name="year")
mdat$year <- as.numeric(substring(mdat$year, 2))
ggplot(mdat, aes(year, value, colour=HEI.ID, group=HEI.ID)) +
geom_point() + stat_smooth(se = FALSE, method="lm")
Faceting may be a better way to got:
ggplot(mdat, aes(year, value, group=HEI.ID)) +
geom_point() + stat_smooth(se = FALSE, method="lm") +
facet_wrap(~HEI.ID)

Floating barcharts

I want to make bar charts where the bar minimum can be specified (much like the box in a box and whisker plot). Can barplot do that? I suspect the answer's in ggplot, but I can't find an example.
Here's some data:
X Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 Highest recorded 31.5 31.8 30.3 28.0 24.9 24.4 21.7 20.9 24.5 25.4 26.0 28.7
2 Mean monthly maximum 27.8 28.6 27.0 24.8 22.0 20.0 18.9 18.8 20.4 22.4 23.9 26.8
3 Mean daily maximum 24.2 24.8 23.1 20.9 18.4 16.3 15.5 15.7 16.9 18.3 20.0 22.4
4 Mean 19.1 19.8 18.1 16.2 13.8 11.9 11.2 11.6 12.7 14.1 15.7 17.7
5 Mean daily minimum 14.0 14.7 13.1 11.4 9.2 7.5 6.9 7.4 8.4 10.0 11.4 13.0
6 Mean monthly minimum 7.6 9.1 6.8 3.8 2.3 -0.5 -0.2 1.0 2.3 3.7 5.3 6.7
7 Lowest recorded 4.0 5.6 4.1 -1.3 0.0 -3.1 -2.6 -1.4 -0.8 2.0 2.7 4.1
xaxis =c("J" ,"F" ,"M" ,"A" ,"M" ,"J","J","A", "S", "O","N","D")
So ideally, I end up with a stacked bar for each month, that starts at the 'Lowest recorded' value, rather than at zero.
I've also had a try with superbarplot from the UsingR package. I can get the bars to start where I want, but can't move the x axis down out of the centre of the plot. Thanks in advance.
You can use geom_boxplot in ggplot2 to get what (I think) you want specifying the precomputed values and stat = 'identity' and use geom_crossbar to put in the other
# first, your data
weather <- read.table(text = 'X Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 "Highest recorded" 31.5 31.8 30.3 28.0 24.9 24.4 21.7 20.9 24.5 25.4 26.0 28.7
2 "Mean monthly maximum" 27.8 28.6 27.0 24.8 22.0 20.0 18.9 18.8 20.4 22.4 23.9 26.8
3 "Mean daily maximum" 24.2 24.8 23.1 20.9 18.4 16.3 15.5 15.7 16.9 18.3 20.0 22.4
4 "Mean" 19.1 19.8 18.1 16.2 13.8 11.9 11.2 11.6 12.7 14.1 15.7 17.7
5 "Mean daily minimum" 14.0 14.7 13.1 11.4 9.2 7.5 6.9 7.4 8.4 10.0 11.4 13.0
6 "Mean monthly minimum" 7.6 9.1 6.8 3.8 2.3 -0.5 -0.2 1.0 2.3 3.7 5.3 6.7
7 "Lowest recorded" 4.0 5.6 4.1 -1.3 0.0 -3.1 -2.6 -1.4 -0.8 2.0 2.7 4.1', header =T)
library(reshape2)
library(ggplot2)
# reshape to wide format (basically transposing the data.frame)
w <- dcast(melt(weather), variable~X)
ggplot(w, aes(x=variable,ymin = `Lowest recorded`,
ymax = `Highest recorded`, lower = `Lowest recorded`,
upper = `Highest recorded`, middle = `Mean daily maximum`)) +
geom_boxplot(stat = 'identity') +
xlab('month') +
ylab('Temperature') +
geom_crossbar(aes(y = `Mean monthly maximum` ))+
geom_crossbar(aes(y = `Mean monthly minimum`)) +
geom_crossbar(aes(y = `Mean daily maximum` ))+
geom_crossbar(aes(y = `Mean daily minimum`))
This is partially described in an example in the help for geom_boxplot

Resources