I am attempting to use the following dataset to reproduce a histogram in R:
ages <- c(26, 31, 35, 37, 43, 43, 43, 44, 45, 47, 48, 48, 49, 50, 51, 51, 51, 51, 52, 54, 54, 54, 54,
55, 55, 55, 56, 57, 57, 57, 58, 58, 58, 58, 59, 59, 62, 62, 63, 64, 65, 65, 65, 66, 66, 67,
67, 72, 86)
I would like to get a histogram that looks as close as possible to this:
However, I am having three problems:
Ia m unable to get my frequency count on the y-axis to reach 18
I haven't been able to get the squiggly break symbol on the x-axis
My breaks don't seem to be properly setting to the vector I entered in my code
I read over ?hist and thought the first two issues could be accomplished by setting xlim and ylim, but that doesn't seem to be working.
I'm at a loss for the third issue since I thought it could be accomplished by including breaks = c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5).
Here's my code so far:
hist(ages, breaks = c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5),
freq=TRUE, col = "lightblue", xlim = c(25.5, 88.5), ylim = c(0,18),
xlab = "Age", ylab = "Frequency")
Followed by my corresponding histogram:
Any bump in the right direction is appreciated.
1. Reaching 18.
It appears that in your data you have at most 17 numbers in the category between 52.5 and 61.5. And that is even with open interval on both sides:
ages <- c(26, 31, 35, 37, 43, 43, 43, 44, 45, 47, 48, 48, 49, 50, 51, 51, 51,
51, 52, 54, 54, 54, 54, 55, 55, 55, 56, 57, 57, 57, 58, 58, 58, 58,
59, 59, 62, 62, 63, 64, 65, 65, 65, 66, 66, 67, 67, 72, 86
)
sum(ages >= 52.5 & ages <= 61.5)
[1] 17
So your histogram only reflects that.
2. Break symbol.
For that you might be interested in THIS SO ANSWER
3. Breaks.
If you read help(hist) you will see that breaks specify the points at which the groups are formed:
... * a vector giving the breakpoints between histogram cells
So you breaks work as intended. The problem you have is with showing the same numbers on x-axis. Here ANOTHER SO ANSWER might help you.
Example
Here is how you could go about reproducing the plot.
library(plotrix) # for the break on x axis
library(shape) # for styled arrow heads
# manually select axis ticks and colors
xticks <- c(25.5, 34.5, 43.5, 52.5, 61.5, 70.5, 79.5, 88.5)
yticks <- seq(2, 18, 2)
bgcolor <- "#F2ECE4" # color for the background
barcolor <- "#95CEEF" # color for the histogram bars
# top level parameters - background color and font type
par(bg=bgcolor, family="serif")
# establish a new plotting window with a coordinate system
plot.new()
plot.window(xlim=c(23, 90), ylim=c(0, 20), yaxs="i")
# add horizontal background lines
abline(h=yticks, col="darkgrey")
# add a histogram using our selected break points
hist(ages, breaks=xticks, freq=TRUE, col=barcolor, xaxt='n', yaxt='n', add=TRUE)
# L-shaped bounding box for the plot
box(bty="L")
# add x and y axis
axis(side=1, at=xticks)
axis(side=2, at=yticks, labels=NA, las=1, tcl=0.5) # for inward ticks
axis(side=2, at=yticks, las=1)
axis.break(1, 23, style="zigzag", bgcol=bgcolor, brw=0.05, pos=0)
# add labels
mtext("Age", 1, line=2.5, cex=1.2)
mtext("Frequency", 2, line=2.5, cex=1.2)
# add arrows
u <- par("usr")
Arrows(88, 0, u[2], 0, code = 2, xpd = TRUE, arr.length=0.25)
Arrows(u[1], 18, u[1], u[4], code = 2, xpd = TRUE, arr.length=0.25)
And the picture:
Related
I am using an xgboost model to predict onto a raster stack. I have successfully used the same approach with CART, xgb and Random Forest models:
library(raster)
# create a RasterStack or RasterBrick with with a set of predictor layers
logo <- brick(system.file("external/rlogo.grd", package="raster"))
names(logo)
# known presence and absence points
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
# extract values for points
xy <- rbind(cbind(1, p), cbind(0, a))
v <- data.frame(cbind(pa=xy[,1], extract(logo, xy[,2:3])))
xgb <- xgboost(data = data.matrix(subset(v, select = -c(pa))), label = v$pa,
nrounds = 5)
raster::predict(model = xgb, logo)
But with xgboost I get the following error:
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix does not support construction from list
The problem is that predict.xgb.Booster does not accept a data.frame for argument newdata (see ?predict.xgb.Booster). That is unexpected (all common predict.* methods take a data.frame), but we can work around it. I show how to do that below, using the "terra" package instead of the obsolete "raster" package (but the solution is exactly the same for either package).
The example data
library(terra)
library(xgboost)
logo <- rast(system.file("ex/logo.tif", package="terra"))
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
v <- extract(logo, xy[,2:3])
xgb <- xgboost(data = data.matrix(v), label=xy[,1], nrounds = 5)
The work-around is to write a prediction function that first coerces the data.frame with "new data" to a matrix. We can use that function with predict<SpatRaster>
xgbpred <- function(model, data, ...) {
predict(model, newdata=as.matrix(data), ...)
}
p <- predict(logo, model=xgb, fun=xgbpred)
plot(p)
I have a list of 10 plots/graphs from model_list for which I used the following code below. I stored these plots in the list var_list.
library(mixOmics)
var_list<-lapply(model_list, function(x) plotVar(x))
var_list contains thus 10 plots, for example below the first element of the list:
> var_list[[1]]
x y Block names pch cex col font Overlap
TPI200 -0.6975577 -0.5582925 X TPI200 1 5 #388ECC 1 Correlation Circle Plots
TPI350 -0.8561514 -0.4101970 X TPI350 1 5 #388ECC 1 Correlation Circle Plots
TPI500 -0.9403552 -0.1074518 X TPI500 1 5 #388ECC 1 Correlation Circle Plots
TPI700 -0.9256605 0.3070954 X TPI700 1 5 #388ECC 1 Correlation Circle Plots
TPI900 -0.8697037 0.4699423 X TPI900 1 5 #388ECC 1 Correlation Circle Plots
I want to save these plots from this list as a jpeg (resulting in 10 different jpeg's). I used the following code and R creates 10 images, but all the images are the same (so only the first plot is created and duplicated for the rest).
lapply(1:length(model_list), function (x) {
jpeg(paste0(names(model_list)[x], ".jpg"))
lapply(model_list, function(x) plotVar(x))
dev.off()
})
I have seen similar questions, but I can't find the right solution to have a jpg for each plot for each dataframe in the list! How can I solve this? Many thanks in advance!
Via this link you can find the dput(model_list[[1]]).
With data provided in a similar post by you, here a possible solution to your issue. It is better if you work around model_list because when you transform to var_list all data become graphical elements. Next code contains a replicate of model_list using datalist but in your real problem you must have it, also must include names for each of the components of the list:
library(mixOmics)
#Data
datalist <- list(df1 = structure(list(OID = c(-1, -1, -1, -1, -1, -1), POINTID = c(1,
2, 3, 4, 5, 6), WETLAND = c("no wetl", "no wetl", "no wetl",
"wetl", "wetl", "wetl"), TPI200 = c(70, 37, 45, 46, 58, 56),
TPI350 = c(67, 42, 55, 58, 55, 53), TPI500 = c(55, 35, 45,
51, 53, 51), TPI700 = c(50, 29, 39, 43, 49, 49), TPI900 = c(48,
32, 41, 46, 47, 46), TPI1000 = c(46, 16, 41, 36, 46, 46),
TPI2000 = c(53, 17, 53, 54, 54, 54), TPI3000 = c(47, 35,
47, 47, 47, 47), TPI4000 = c(49, 49, 49, 49, 49, 49), TPI5000 = c(63,
63, 63, 62, 62, 61), TPI2500 = c(48, 26, 48, 49, 49, 49)), row.names = c(NA,
6L), class = "data.frame"), df2 = structure(list(OID = c(-1,
-1, -1, -1, -1, -1), POINTID = c(1, 2, 3, 4, 5, 6), WETLAND = c("no wetl",
"no wetl", "no wetl", "wetl", "wetl", "wetl"), TPI200 = c(70,
37, 45, 46, 58, 56), TPI350 = c(67, 42, 55, 58, 55, 53), TPI500 = c(55,
35, 45, 51, 53, 51), TPI700 = c(50, 29, 39, 43, 49, 49), TPI900 = c(48,
32, 41, 46, 47, 46), TPI1000 = c(46, 16, 41, 36, 46, 46), TPI2000 = c(53,
17, 53, 54, 54, 54), TPI3000 = c(47, 35, 47, 47, 47, 47), TPI4000 = c(49,
49, 49, 49, 49, 49), TPI5000 = c(63, 63, 63, 62, 62, 61), TPI2500 = c(48,
26, 48, 49, 49, 49)), row.names = c(NA, 6L), class = "data.frame"))
#Function
custom_splsda <- function(datalist, ncomp, keepX, ..., Xcols, Ycol){
Y <- datalist[[Ycol]]
X <- datalist[Xcols]
res <- splsda(X, Y, ncomp = ncomp, keepX = keepX, ...)
res
}
#Create model_list, you must have the object
model_list <- lapply(datalist, custom_splsda,
ncomp = 2, keepX = c(5, 5),
Xcols = 4:8, Ycol = "WETLAND")
Next the loop for plots:
#Loop
for(i in 1:length(model_list))
{
jpeg(paste0(names(model_list)[i], ".jpg"))
plotVar(model_list[[i]],title = names(model_list)[i])
dev.off()
}
That will produce plots in your folder as you can see here:
And also the plots that change (see titles):
I managed to get the plot that I want by manually typing in the commands, but I failed to change the x-axis from numerical scale to categorical variables which are months (Jan-Dec). Please assist with commands I can use to replicate the plot below from imported data, csv format?
mydata <- c(1,2,3,4,5,6,7,8,9,10,11,12)
Y1 <- c(20, 18, 40, 54, 63, 63, 59, 72, 64, 60, 46, 44 )
Y2 <- c(16, 32, 89, 87, 64, 64, 39, 36, 29, 55, 58, 46 )
Y3 <- c(46, 39, 43, 67, 45, 37, 13, 11, 38, 59, 53, 47 )
Y4 <- c(32, 41, 45, 43, 62, 69, 94, 66, 62, 58, 87, 34 )
Y5 <- c(51, 31, 91, 99, 37, 62, 12, 30, 65, 59, 72, 44 )
plot(mydata, Y2010, type="l", col="#FF0000", pch="*", lty=1, ylim=c(0,100), xlab = "Months", ylab="Y %" )
points(mydata, Y2011, col="#FFFF00", pch="*")
lines(mydata, Y2011, col="#FFFF00",lty=1)
points(mydata, Y2012, col="#00FF00",pch="*")
lines(mydata, Y2012, col="#00FF00",lty=1)
points(mydata, Y2013, col="#0000FF", pch="*")
lines(mydata, Y2013, col="#0000FF",lty=1)
points(mydata, Y2014, col="#00FFFF", pch="*")
lines(mydata, Y2014, col="#00FFFF",lty=1)
points(mydata, Y2015, col="#FF00FF", pch="*")
lines(mydata, Y2015, col="#FF00FF",lty=1)
legend("topleft", c("2010","2011","2012","2013","2014","2015"), col=c("#FF0000","#FFFF00","#00FF00","#0000FF","#00FFFF","#FF00FF"),
pch=c("*","*","*","*","*"),lty=c(1,2,3,1,2), ncol=1, cex = 0.5)
I want my x-axis to range from Jan to Dec instead of 1 to 12
You can try this ggplot2 approach (Updated):
library(reshape2)
library(ggplot2)
mydata <- factor(format(seq(as.Date('2020-01-01'), length.out=12, by='1 month'),'%b'),
levels = format(seq(as.Date('2020-01-01'), length.out=12, by='1 month'),'%b'),
ordered = T)
Y1 <- c(20, 18, 40, 54, 63, 63, 59, 72, 64, 60, 46, 44 )
Y2 <- c(16, 32, 89, 87, 64, 64, 39, 36, 29, 55, 58, 46 )
Y3 <- c(46, 39, 43, 67, 45, 37, 13, 11, 38, 59, 53, 47 )
Y4 <- c(32, 41, 45, 43, 62, 69, 94, 66, 62, 58, 87, 34 )
Y5 <- c(51, 31, 91, 99, 37, 62, 12, 30, 65, 59, 72, 44 )
#Create data
df <- data.frame(mydata,Y1,Y2,Y3,Y4,Y5)
#Melt
df.melted <- melt(df,id.vars = 'mydata')
#plot
ggplot(df.melted,aes(x=factor(mydata),y=value,color=variable,group=variable))+
geom_point()+
geom_line()+
scale_color_manual(labels=c(2011:2015),values = c('red','blue','green','pink','orange'))+
xlab('Year')
I'm trying to figure out how to calculate the area of a river cross section.
For the cross section I have the depth at every 25 cm over the 5 m wide river.
x_profile <- seq(0, 500, 25)
y_profile = c(50, 73, 64, 59, 60, 64, 82, 78, 79, 76, 72, 68, 63, 65, 62, 61, 56, 50, 44, 39, 25)
If anyone have some suggestions of how this could be done in r it's highly appreciated.
We can use the sf package to create a polygon showing the cross-section and then calculate the area. Notice that to create a polygon, it is necessary to provide three more points as c(0, 0), c(500, 0), and c(0, 0) when creating the matrix m.
x_profile <- seq(0, 500, 25)
y_profile <- c(50, 73, 64, 59, 60, 64, 82, 78, 79, 76, 72,
68, 63, 65, 62, 61, 56, 50, 44, 39, 25)
library(sf)
# Create matrix with coordinates
m <- matrix(c(0, x_profile, 500, 0, 0, -y_profile, 0, 0),
byrow = FALSE, ncol = 2)
# Create a polygon
poly <- st_polygon(list(m))
# View the polygon
plot(poly)
# Calcualte the area
st_area(poly)
31312.5
I can't find a description of what the end points of the lines of a boxplot represent.
For example, here are point values above and below where the lines end.
(I realize that the top and bottom of the box are 25th and 75th percentile, and the centerline is the 50th). I assume, as there are points above and below the lines that they do not represent the max/min values.
The "dots" at the end of the boxplot represent outliers. There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot use is the "1.5 rule". If a data point is:
less than Q1 - 1.5*IQR
greater than Q3 + 1.5*IQR
then that point is classed as an "outlier". The whiskers are defined as:
upper whisker = min(max(x), Q_3 + 1.5 * IQR)
lower whisker = max(min(x), Q_1 – 1.5 * IQR)
where IQR = Q_3 – Q_1, the box length. So the upper whisker is located at the smaller of the maximum x value and Q_3 + 1.5 IQR,
whereas the lower whisker is located at the larger of the smallest x value and Q_1 – 1.5 IQR.
Additional information
See the wikipedia boxplot page for alternative outlier rules.
There are actually a variety of ways of calculating quantiles. Have a look at `?quantile for the description of the nine different methods.
Example
Consider the following example
> set.seed(1)
> x = rlnorm(20, 1/2)#skewed data
> par(mfrow=c(1,3))
> boxplot(x, range=1.7, main="range=1.7")
> boxplot(x, range=1.5, main="range=1.5")#default
> boxplot(x, range=0, main="range=0")#The same as range="Very big number"
This gives the following plot:
As we decrease range from 1.7 to 1.5 we reduce the length of the whisker. However, range=0 is a special case - it's equivalent to "range=infinity"
I think ggplot using the standard defaults, the same as boxplot: "the whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box"
See: boxplot.stats
P1IMSA Tutorial 8 - Understanding Box and Whisker Plots video offers a visual step-by-step explanation of (Tukey) box and whisker plots.
At 4m 23s I explain the meaning of the whisker ends and its relationship to the 1.5*IQR.
Although the chart shown in the video was rendered using D3.js rather than R, its explanations jibe with the R implementations of boxplots mentioned.
As highlighted by #TemplateRex in a comment, ggplot doesn't draw the whiskers at the upper/lower quartile plus/minus 1.5 times the IQR. It actually draws them at max(x[x < Q3 + 1.5 * IQR]) and min(x[x > Q1 + 1.5 * IQR]). For example, here is a plot drawn using geom_boxplot where I've added a dashed line at the value Q1 - 1.5*IQR:
Q1 = 52
Q3 = 65
Q1 - 1.5 * IQR = 52 - 13*1.5 = 32.5 (dashed line)
Lower whisker = min(x[x > Q1 + 1.5 * IQR]) = 35 (where x is the data used to create the boxplot, outlier is at x = 27).
MWE
Note this isn't the exact code I used to produce the image above but it gets the point over.
library("mosaic") # For favstats()
df <- c(54, 41, 55, 66, 71, 50, 65, 54, 72, 46, 36, 64, 49, 64, 73,
52, 53, 66, 49, 64, 44, 56, 49, 54, 61, 55, 52, 64, 60, 54, 59,
67, 58, 51, 63, 55, 67, 68, 54, 53, 58, 26, 53, 56, 61, 51, 51,
50, 51, 68, 60, 67, 66, 51, 60, 52, 79, 62, 55, 74, 62, 59, 35,
67, 58, 74, 48, 53, 40, 62, 67, 57, 68, 56, 75, 55, 41, 50, 73,
57, 62, 61, 48, 60, 64, 53, 53, 66, 58, 51, 68, 69, 69, 58, 54,
57, 65, 78, 70, 52, 59, 52, 65, 70, 53, 57, 72, 47, 50, 70, 41,
64, 59, 58, 65, 57, 60, 70, 46, 40, 76, 60, 64, 51, 38, 67, 57,
64, 51)
df <- as.data.frame(df)
Q1 <- favstats(df)$Q1
Q3 <- favstats(df)$Q3
IQR <- Q3 - Q1
lowerlim <- Q1 - 1.5*IQR
upperlim <- Q3 + 1.5* IQR
boxplot_Tukey_lower <- min(df[df > lowerlim])
boxplot_Tukey_upper <- max(df[df < upperlim])
ggplot(df, aes(x = "", y = df)) +
stat_boxplot(geom ='errorbar', width = 0.5) +
geom_boxplot() +
geom_hline(yintercept = lowerlim, linetype = "dashed") +
geom_hline(yintercept = upperlim, linetype = "dashed")