Calculating mean and interquartile range of 'cut' data to plot - r

Apologies I am new to R, I have a dataset with height and canopy density of trees for example:
i_h100 i_cd
2.89 0.0198
2.88 0.0198
17.53 0.658
27.23 0.347
I want to regroup 'h_100' into 2m intervals going from 2m min to 30m max, I then want to calculate the mean i_cd value and interquartile range for each of these intervals so that I can then plot these with a least squares regression. There is something wrong with the code I am using to get the mean. This is what I have so far:
mydata=read.csv("irelandish.csv")
height=mydata$i_h100
breaks=seq(2,30,by=2) #2m intervals
height.cut=cut(height, breaks, right=TRUE)
#attempt at calculating means per group
install.packages("dplyr")
mean=summarise(group_by(cut(height, breaks, right=TRUE),
mean(mydata$i_cd)))
install.packages("reshape2")
dcast(mean)
Thanks in advance for any advice.

Using aggregate() to calculate the groupwise means.
# Some example data
set.seed(1)
i_h100 <- round(runif(100, 2, 30), 2)
i_cd <- rexp(100, 1/i_h100)
mydata <- data.frame(i_cd, i_h100)
# Grouping i_h100
mydata$i_h100_2m <- cut(mydata$i_h100, seq(2, 30, by=2))
head(mydata)
# i_cd i_h100 i_h100_2m
# 1 2.918093 9.43 (8,10]
# 2 13.735728 12.42 (12,14]
# 3 13.966347 18.04 (18,20]
# 4 2.459760 27.43 (26,28]
# 5 8.477551 7.65 (6,8]
# 6 6.713224 27.15 (26,28]
# Calculate groupwise means of i_cd
i_cd_2m_mean <- aggregate(i_cd ~ i_h100_2m, mydata, mean)
# And IQR
i_cd_2m_iqr <- aggregate(i_cd ~ i_h100_2m, mydata, IQR)
upper <- i_cd_2m_mean[,2]+(i_cd_2m_iqr[,2]/2)
lower <- i_cd_2m_mean[,2]-(i_cd_2m_iqr[,2]/2)
# Plotting the result
plot.default(i_cd_2m_mean, xaxt="n", ylim=range(c(upper, lower)),
main="Groupwise means \U00B1 0.5 IQR", type="n")
points(upper, pch=2, col="lightblue", lwd=1.5)
points(lower, pch=6, col="pink", lwd=1.5)
points(i_cd_2m_mean, pch=16)
axis(1, i_cd_2m[,1], as.character(i_cd_2m[,1]), cex.axis=0.6, las=2)

Here is a solution,
library(reshape2)
library(dplyr)
mydata <- data_frame(i_h100=c(2.89,2.88,17.53,27.23),i_cd=c(0.0198,0.0198,0.658,0.347))
height <- mydata$i_h100
breaks <- seq(2,30,by=2) #2m intervals
height.cut <- cut(height, breaks, right=TRUE)
mydata$height.cut <- height.cut
mean_i_h100 <- mydata %>% group_by(height.cut) %>% summarise(mean_i_h100 = mean(i_h100))
A few remarks:
it is better to avoid naming variables with function names, so I changed the mean variable to mean_i_h100
I am using the pipe notation, which makes the code more readable, it avoids repeating the first argument of each function, you can find a more detailed explanation here.
Without the pipe notation, the last line of code would be:
mean_i_h100 <- summarise(group_by(mydata,height.cut),mean_i_h100 = mean(i_h100))
you have to load the two packages you installed with library

Related

Creating boxplot on log scale in R

I am trying to plot a boxplot in R, where the input file has multiple columns and each column has different number of rows. With the help given on help on the following link:
boxplot of vectors with different length
I am trying:
x <- read.csv( 'filename.csv', header = T )
plot(
1, 1,
xlim=c(1,ncol(x)), ylim=range(x[-1,], na.rm=TRUE),
xaxt='n', xlab='', ylab=''
)
axis(1, labels=colnames(x), at=1:ncol(x))
for(i in 1:ncol(x)) {
p <- x[,i]
boxplot(p, add=T, at=i)
}
I am trying to plot the values in log scale. But defining log ="y", I am getting the following error:
Error in xypolygon(xx, yy, lty = "blank", col = boxfill[i]) :
plot.new has not been called yet
Following is the sample of my input csv data:
A B C D
2345.42 932.19 40.8 26.19
138.48 1074.1 4405.62 4077.16
849.35 0.0 1451.66 1637.39
451.38 146.22 4579.6 5133.14
5749.01 7250.08 12.23 0.09
4125.48 129.46 49.51
440.38 6405.02
Your data as a reproducible example
Note I had to remove an extra element
library(data.table)
df <- fread("A,B,C,D
2345.42,932.19,40.8,26.19
138.48,1074.1,4405.62,4077.16
849.35,0.0,1451.66,1637.39
451.38,146.22,4579.6,5133.14
5749.01,7250.08,12.23,0.09
4125.48,129.46,49.51,440.38", sep=",", header=T)
dplyr and tidyr solution
library(dplyr)
library(tidyr)
df1 <- df %>%
replace(.==0,NA) %>% # make 0 into NA
gather(var,values,A:D) %>% # convert from wide (4-col) to long (2-col) format
mutate(values = log10(values)) # log10 transform
If you want log2, simply replace log10 with log2
Output
boxplot(values ~ var, df1)
A little extra
For log10 scale, I like to add 1 to my values to eliminate negative values since log10(0 < x < 1) = -value. This sets the minimum value on your plot as 0 since 0 + 1 = 1 and log10(1) = 0

How do I estimate the coordinates of points along a graphed line in R?

Assume I have data:
x <- c(1900,1930,1944,1950,1970,1980,1983,1984)
y <- c(100,300,500,1500,2500,3500,4330,6703)
I then plot this data and add a line graph between my known x and y coordinates:
plot(x,y)
lines(x,y)
Is there a way to predict coordinates of unknown points along the graphed line?
You can use approxfun.
f <- approxfun(x, y=y)
f(seq(1900, 2000, length.out = 10))
# [1] 100.0000 174.0741 248.1481 347.6190 574.0741 1777.7778 2333.3333
# [8] 3277.7778 NA NA
Note the NA, when the sequence is outside the range of interpolated points (there are left and right options to approxfun).
You can manually calculate the slopes of each line.
The equation of a line is given by
y − y1 = grad*(x − x1)
where grad is calculated by the change in y divided by the change in x
We can produce equations for each line by using two points from each line in the plot.
f2 <- function(xnew, X=x, Y=y) {
id0 <- findInterval(xnew, X, rightmost=T)
id1 <- id0 + 1
grad <- (Y[id1] - Y[id0]) / (X[id1] - X[id0])
Y[id0] + grad* (xnew - X[id0])
}
f2(x)
#[1] 100 300 500 1500 2500 3500 4330 6703
f <- approxfun(x, y=y) # bunks
f(seq(1900, 2000, length.out = 10))
# [1] 100.0000 174.0741 248.1481 347.6190 574.0741 1777.7778 2333.3333 3277.7778 NA NA
f2(seq(1900, 2000, length.out = 10))
# [1] 100.0000 174.0741 248.1481 347.6190 574.0741 1777.7778 2333.3333 3277.7778 NA NA
If you wanted to extrapolate using the final slope, you can do this by adding
the all.in=TRUE argument to findInterval.
With that said, approxfun does it better & easier!
Or, if you want to approximate points along the lines in regular intervals, you can use approx instead of approxfun. For that purpose, this maybe saves you a tiny bit of coding.
x <- c(1900,1930,1944,1950,1970,1980,1983,1984)
y <- c(100,300,500,1500,2500,3500,4330,6703)
new_points <- approx(x, y)
lapply(new_points, head)
#> $x
#> [1] 1900.000 1901.714 1903.429 1905.143 1906.857 1908.571
#>
#> $y
#> [1] 100.0000 111.4286 122.8571 134.2857 145.7143 157.1429
plot(new_points)
lines(x,y)
Created on 2022-11-20 with reprex v2.0.2

Get specific elements from clustered data in R

I generate this image using the hclust function. Now I wand to ID of those elements highlighted by squares.
Is there any way to get the ID and related value from the clusted datasets? Thanks
EDIT
I used this R script
library(gplots)
library(geneplotter)
# read the data in from URL
bots <- read.table("expression.txt")
# get just the alpha data
abot <- bots[,c(1:9)]
rownames(abot) <- bots[,1]
abot[1:7,]
# get rid of NAs
abot[is.na(abot)] <- 0
# we need to find a way of reducing the data. Can't do ANOVA as there are no
# replicates. Sort on max difference and take first 1000
min <-apply(abot, 1, min)
max <- apply(abot, 1, max)
sabot <- abot[order(max - min, decreasing=TRUE),][1:1000,]
# cluster on correlation
cdist <- as.dist(1 - cor(t(sabot)))
hc <- hclust(cdist, "average")
# draw a heatmap
x11()
heatmap.2(as.matrix(sabot),
Rowv=as.dendrogram(hc),
Colv=FALSE,
cexRow=1,
cexCol=1,
dendrogram="row",
scale="row",
trace="none",
density.info="none",
key=FALSE,
col=greenred.colors(80))
and my data look like this
YF MF SF YL ML SL Stem Root SULE
1 31.64075611 32.2728151 38.81790359 252.8901009 269.7599455 138.5011042 16.58308894 10.47935935 3.364295997
2 6.484902171 9.141084197 5.748798541 3.637332586 4.762966989 4.149302282 7.194971046 9.932508868 1.600027931
3 14.15218386 8.784155316 9.740794214 6.566584262 6.130503033 7.747728536 12.57014531 15.75181203 9.22907038
4 15.72881736 19.95755802 10.13050089 10.31313758 9.838844457 14.24864327 13.00442008 23.85404067 12.17251862
5 30.45475953 15.57131432 17.15277867 8.884751572 8.78786964 12.4745649 11.90176123 35.9844343 6.904763942
6 15.87149807 19.05523246 13.12846166 12.99750491 15.3775883 19.0044086 21.66051467 20.38501538 39.58478032
7 16.58935728 18.63990933 17.20955634 13.04423927 29.98424087 18.02165996 22.22403582 32.38377369 10.90832984
8 29.91118855 19.65844846 23.45958109 62.56338088 55.3926187 39.85296152 31.4832543 14.8484163 1.326553777
9 4.09192129 15.52499475 12.14321788 1.680854758 3.448485979 5.245481483 15.14443161 28.85873063 1.073855381
10 7.02768911 4.267210165 3.383501945 3.53716686 3.105614581 3.493791292 3.806360251 6.713067543 3.338740245
11 17.61821596 18.03607855 12.939663 8.951935241 15.45268577 15.53817186 20.5098186 23.42760284 27.97680418
12 66.35291651 40.41837702 37.7239447 32.42998176 30.09696289 27.81089554 33.27197681 46.5393928 4.141505618
13 15.45804403 15.98469202 17.21176468 9.105208867 11.76140929 13.9751105 14.72159466 25.68388472 7.493988128

Partial Autocorrelation without using pacf() in r

If I think I understand something I like to verify, so in this case I was trying to verify the calculation of the Partial Autocorrelation. pacf().
what I end up with is something a little different. My understanding is that the pacf would be the coefficient of the regression of the last/furthest lag given all of the previous lags. So to set up some code, I'm using Canadian employment data and the book Elements of Forecasting by F. Diebold (1998) Chapter6
#Obtain Canadian Employment dataset
caemp <- c(83.090255, 82.7996338824, 84.6344380294, 85.3774583529, 86.197605, 86.5788438824, 88.0497240294, 87.9249263529, 88.465131, 88.3984638824, 89.4494320294, 90.5563753529, 92.272335, 92.1496788824, 93.9564890294, 94.8114863529, 96.583434, 96.9646728824, 98.9954360294, 101.138164353, 102.882122, 103.095394882, 104.006386029, 104.777404353, 104.701732, 102.563504882, 103.558486029, 102.985774353, 102.098281, 101.471734882, 102.550696029, 104.021564353, 105.093652, 105.194954882, 104.594266029, 105.813184353, 105.149642, 102.899434882, 102.354736029, 102.033974353, 102.014299, 101.835654882, 102.018806029, 102.733834353, 103.134062, 103.263354882, 103.866416029, 105.393274353, 107.081242, 108.414274882, 109.297286029, 111.495994353, 112.680072, 113.061304882, 112.376636029, 111.244054353, 107.305192, 106.678644882, 104.678246029, 105.729204353, 107.837082, 108.022364882, 107.281706029, 107.016934353, 106.045452, 106.370704882, 106.049966029, 105.841184353, 106.045452, 106.650644882, 107.393676029, 108.668584353, 109.628702, 110.261894882, 110.920946029, 110.740154353, 110.048622, 108.190324882, 107.057746029, 108.024724353, 109.712692, 111.409654882, 108.765396029, 106.289084353, 103.917902, 100.799874882, 97.3997700294, 93.2438143529, 94.123068, 96.1970798824, 97.2754290294, 96.4561423529, 92.674237, 92.8536228824, 93.4304540294, 93.2055593529, 93.955896, 94.7296738824, 95.5665510294, 95.5459793529, 97.09503, 97.7573598824, 96.1609430294, 96.5861653529, 103.874812, 105.094384882, 106.804276029, 107.786744353, 106.596022, 107.310354882, 106.897156029, 107.210924353, 107.134682, 108.829774882, 107.926196029, 106.298904353, 103.365872, 102.029554882, 99.3000760294, 95.3045073529, 90.50099, 88.0984848824, 86.5150710294, 85.1143943529, 89.033584, 88.8229008824, 88.2666710294, 87.7260053529, 88.102896, 87.6546968824, 88.4004090294, 88.3618013529, 89.031151, 91.0202948824, 91.6732820294, 92.0149173529)
# create time series with the canadian employment dataset
caemp.ts<-ts(caemp, start=c(1961, 1), end=c(1994, 4), frequency=4)
caemp.ts2<-window(caemp.ts,start=c(1961,5), end=c(1993,4))
# set up max lag the book says use sqrt(T) but in this case i'm using 3 for the example
lag.max <- 3
# R Code using pacf()
pacf(caemp.ts2, lag.max=3, plot=F)
# initialize vector to capture the partial autocorrelations
pauto.corr <- rep(0, lag.max)
# Set up lagged data frame
pa.mat <- as.data.frame(caemp.ts2)
for(i in 1:lag.max){
a <- c(rep(NA, i), pa.mat[1:(length(caemp.ts2) - i),1])
pa.mat <- cbind(pa.mat, a)
}
names(pa.mat) <- c("0":lag.max)
# Set up my base linear model
base.lm <- lm(pa.mat[, 1] ~ 1)
### I could not get the for loop to work successfully here
i <- 1
base.lm <- update(base.lm, .~. + pa.mat[,2])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-2
base.lm <-update(base.lm, .~. + pa.mat[,3])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-3
base.lm <-update(base.lm, .~. + pa.mat[,4])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
# Compare results...
round(pauto.corr,3)
pacf(caemp.ts2, lag.max=3, plot=F)
For the output
> round(pauto.corr,3)
[1] 0.971 -0.479 -0.072
> pacf(caemp.ts2, lag.max=3, plot=F)
Partial autocorrelations of series ‘caemp.ts2’, by lag
0.25 0.50 0.75
0.949 -0.244 -0.100
Maybe it is because my example is quarterly and not monthly data, or I could just be wrong?

divide a range of values in bins of equal length: cut vs cut2

I'm using the cut function to split my data in equal bins, it does the job but I'm not happy with the way it returns the values. What I need is the center of the bin not the upper and lower ends.
I've also tried to use cut2{Hmisc}, this gives me the center of each bins, but it divides the range of data in bins that contains the same numbers of observations, rather than being of the same length.
Does anyone have a solution to this?
It's not too hard to make the breaks and labels yourself, with something like this. Here since the midpoint is a single number, I don't actually return a factor with labels but instead a numeric vector.
cut2 <- function(x, breaks) {
r <- range(x)
b <- seq(r[1], r[2], length=2*breaks+1)
brk <- b[0:breaks*2+1]
mid <- b[1:breaks*2]
brk[1] <- brk[1]-0.01
k <- cut(x, breaks=brk, labels=FALSE)
mid[k]
}
There's probably a better way to get the bin breaks and midpoints; I didn't think about it very hard.
Note that this answer is different than Joshua's; his gives the median of the data in each bins while this gives the center of each bin.
> head(cut2(x,3))
[1] 16.666667 3.333333 16.666667 3.333333 16.666667 16.666667
> head(ave(x, cut(x,3), FUN=median))
[1] 18 2 18 2 18 18
Use ave like so:
set.seed(21)
x <- sample(0:20, 100, replace=TRUE)
xCenter <- ave(x, cut(x,3), FUN=median)
We can use smart_cut from package cutr:
devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Joshua's sample data:
median by interval (same output as #Joshua except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ median(.))
# [1] 18 2 18 2 18 18 ...
# Levels: 2 < 11 < 18
center of each interval (same output as #Aaron except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ mean(.y))
# [1] 16.67 3.333 16.67 3.333 16.67 16.67 ...
# Levels: 3.333 < 10 < 16.67
mean of values by interval :
smart_cut(x,3, "n_intervals", labels= ~ mean(.))
# [1] 17.48 2.571 17.48 2.571 17.48 17.48 ...
# Levels: 2.571 < 11.06 < 17.48
labels can be a character vector just like in base::cut.default, but it can also be, as it is here, a function of 2 parameters, the first being the values contained in the bin, and the second the cut points of the bin.
more on cutr and smart_cut

Resources