identifying points in a histogram - r

in R I'm trying to interactively identify bin value in a histogram using the mouse. I think I need something equivalent to the identify() function for scatterplots. But identify() doesn't seem to work for histograms.

Use locator() to find the points, then lookup which interval the value sits in, make sure it is less than the y-value for the bar, then return the count:
set.seed(100)
h <- hist(rnorm(1:100))
# use locator() when doing this for real, i'm going to use a saved set of points
#l <- locator()
l <- list(x = c(-2.22, -1.82, -1.26, -0.79,-0.57, -0.25, 0.18, 0.75,
0.72, 1.26), y = c(1.46, 7.81, 3.79, 9.08, 17.11, 11.61, 15,
17.96, 5.9, 3.37))
# for debugging purposes - the nth value of the output should match where
# the n value is shown on the histogram
text(l, labels=1:10, cex=0.7, font=2)
fi <- findInterval(l$x, h$breaks)
sel <- (l$y > 0) & (l$y < h$counts[fi])
replace(h$counts[fi], !sel, NA)
#[1] 3 NA 9 14 NA 22 20 NA 13 7

Related

R - Categorize a dataset

Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)

Find indices of slope changes in a vector in R

I have a data frame with two columns: (1) datetimes and (2) streamflow values. I would like to create a 3rd column with indicator values to find sudden increases (usually a 0 but it is a 1 when the streamflow shows a big increase).
datetime <- as.POSIXct(c(1557439200, 1557440100, 1557441000, 1557441900,1557442800,
1557443700, 1557444600, 1557445500, 1557446400, 1557447300, 1557448200, 1557449100, 1557450000, 1557450900,
1557451800, 1557452700, 1557453600, 1557454500, 1557455400, 1557456300, 1557457200, 1557458100, 1557459000), origin = "1970-01-01")enter code here
streamflow <- c(0.35, 0.35, 0.36, 0.54, 1.0, 2.7, 8.4, 9.3, 6.2, 3.8, 4.7,
2.91, 2.01, 1.65, 1.41, 1.12, 0.95, 0.62, 0.52, 0.53, 0.53, 0.44, 0.35)
data <- data.table(as.POSIXct(datetime), as.numeric(streamflow))
I am trying to create a function that would identify the datetime of where it jumps from 0.5 to 1 because that is when the event starts. It would then stop indicating it is an event when the streamflow goes below a certain threshold.
My current idea is a function that compares the local slope between two consecutive points in streamflow to a slope of all the values of streamflow within some window, but I don't really know how to write that. Or maybe there is a better idea for how to do what I am trying to do
data = data[, delta := (V2-lag(V2))/lag(V2)][
, ind_jump := delta > 0.5
]
indices <- data[ind_jump==TRUE, V1]
Not related to this, but for some weird reason R gives
(0.54 - 0.36)/0.36 > 0.5
[1] TRUE
while
0.18/0.36 > 0.5
[1] FALSE

How to compute land cover area in R

Basically, I computed a global distribution probability model in the form of ASCII, say:
gdpm. gdpm's values are all between 0 and 1.
Then I imported a local map from shape file:
shape <- file.choose()
map <- readOGR(shape, basename(file_path_sans_ext(shape)))
The next step, I rasterized gdpm, and cropped using the local map:
ldpm <- mask(gdpm, map)
Then, I reclassified this continuous model into a discrete model (I divided the model into 6 levels):
recalc <- matrix(c(0, 0.05, 0, 0.05, 0.2, 1, 0.2, 0.4, 2, 0.4, 0.6, 3, 0.6, 0.8, 4, 0.8, 1, 5), ncol = 3, byrow = TRUE)
ldpmR <- reclassify(ldpm, recalc)
I've got a cropped and reclassified raster, now I need to summarize land cover, that is, to each level, I want to calculate its proportion of area in each region of the local map. (I don't know how to describe it in terminology). I found and followed an example (RobertH):
ext <- raster::extract(ldpmR, map)
tab <- sapply(ext, function(x) tabulate(x, 10))
tab <- tab / colSums(tab)
But I'm not sure if it works, since the output of tab is confusing.
So how to compute land cover area correctly? How can I apply the correct method within each polygon?
My original data is too large, I can only provide an alternative raster (I think this example should apply a different reclassify matrix):
Example raster
Or you can generate a test raster (RobertH):
library(raster)
s <- stack(system.file("external/rlogo.grd", package="raster"))
writeRaster(s, file='testtif', format='GTiff', bylayer=T, overwrite=T)
f <- list.files(pattern="testtif_..tif")
I also have a question about plotting a raster:
r <- as(r, "SpatialPixelsDataFrame")
r <- as.data.frame(r)
colnames(r) <- c("value", "x", "y")
I do this conversion to make a raster plot-able with ggplot2, is there a more concise method?
loki's answer is OK, but this can be done the raster way, which is safer. And it is important to consider whether the coordinates are angular (longitude/latitude) or planar (projected)
Example data
library(raster)
r <- raster(system.file("external/test.grd", package="raster"))
r <- r / 1000
recalc <- matrix(c(0, 0.05, 0, 0.05, 0.2, 1, 0.2, 0.4, 2, 0.4, 0.6, 3, 0.6, 0.8, 4, 0.8, 2, 5), ncol = 3, byrow = TRUE)
r2 <- reclassify(r, recalc)
Approach 1. Only for planar data
f <- freq(r2, useNA='no')
apc <- prod(res(r))
f <- cbind(f, area=f[,2] * apc)
f
# value count area
#[1,] 1 78 124800
#[2,] 2 1750 2800000
#[3,] 3 819 1310400
#[4,] 4 304 486400
#[5,] 5 152 243200
Approach 2. For angular data (but also works for planar data)
a <- area(r2)
z <- zonal(a, r2, 'sum')
z
# zone sum
#[1,] 1 124800
#[2,] 2 2800000
#[3,] 3 1310400
#[4,] 4 486400
#[5,] 5 243200
If you want to summarize by polygons, you can do something like this:
# example polygons
a <- rasterToPolygons(aggregate(r, 25))
Approach 1
# extract values (slow)
ext <- extract(r2, a)
# tabulate values for each polygon
tab <- sapply(ext, function(x) tabulate(x, 5))
# adjust for area (planar data only)
tab <- tab * prod(res(r2))
# check the results, by summing over the regions
rowSums(tab)
#[1] 124800 2800000 1310400 486400 243200
Approach 2
x <- rasterize(a, r2)
z <- crosstab(x, r2)
z <- cbind(z, area = z[,3] * prod(res(r2)))
Check results:
aggregate(z[, 'area', drop=F], z[,'Var2', drop=F], sum)
Var2 area
#1 1 124800
#2 2 2800000
#3 3 1310400
#4 4 486400
#5 5 243200
Note that if you are dealing with lon/lat data you cannot use prod(res(r)) to get the cell size. In that case you will need to use the area function and loop over classes, I think.
You also asked about plotting. There are many ways to plot a Raster* object. The basic ones are:
image(r2)
plot(r2)
spplot(r2)
library(rasterVis);
levelplot(r2)
More tricky approaches:
library(ggplot2) # using a rasterVis method
theme_set(theme_bw())
gplot(r2) + geom_tile(aes(fill = value)) +
facet_wrap(~ variable) +
scale_fill_gradient(low = 'white', high = 'blue') +
coord_equal()
library(leaflet)
leaflet() %>% addTiles() %>%
addRasterImage(r2, colors = "Spectral", opacity = 0.8)
Seems like you can get the area by the number of pixels.
Let's start with a reproducible example:
r <- raster(system.file("external/test.grd", package="raster"))
plot(r)
Since, the values in this raster are in another range than your data, let's adapt them to your values:
r <- r / 1000
r[r>1,] <- 1
Afterwards, we apply your reclassification:
recalc <- matrix(c(0, 0.05, 0, 0.05, 0.2, 1, 0.2, 0.4, 2, 0.4, 0.6, 3, 0.6, 0.8, 4, 0.8, 1, 5), ncol = 3, byrow = TRUE)
r2 <- reclassify(r, recalc)
plot(r2)
Now, how do we get the area?
Since you are working with a projected raster, you can simply use the number of pixels and the raster resolution. Therefore, we first need to check the resolution and the map units of the projection:
res(r)
# [1] 40 40
crs(r)
# CRS arguments:
# +init=epsg:28992
# +towgs84=565.237,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +proj=sterea
# +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000
# +y_0=463000 +ellps=bessel +units=m +no_defs
Now, we know that we are dealing with pixels of 40 x40 meters, since we have a metric CRS.
Let's use this information to calculate the area of each class.
app <- res(r)[1] * res(r)[2] # area per pixel
table(r2[]) * app
# 1 2 3 4 5
# 124800 2800000 1310400 486400 243200
For the plotting of georeferenced rasters, I would like to refer you to an older question here on SO

Plotting Quantiles values of boxplot in R inside a for loop

Suppose I have a data frame airquality. I made a for loop to plot all the boxplot of the air-quality data set.
name <- names(airquality)
classes<-sapply(airquality,class)
airquality[is.na(airquality)] <- 0
for (name in name[classes == 'numeric']) {
boxplot(airquality[,name])
}
Now I want to display all the Quantiles values i.e. First Quantile, Median, Third Quantile and mean as in the below image. I searched the web a lot but didn't find anything that suits my need. Below is the desired graph which I want to plot:
Here is an example, just using the "Wind" attribute.
B = boxplot(airquality[,"Wind"])
text(1.3, B$stats, B$stats)
IQR = B$stats[4] - B$stats[2]
segments(0.5, c(B$stats[2], B$stats[4]), 0.7, c(B$stats[2], B$stats[4]))
text(0.6, B$stats[3], IQR)
arrows(0.6, B$stats[3]+0.5, 0.6, B$stats[4]-0.1, 0.1)
arrows(0.6, B$stats[3]-0.5, 0.6, B$stats[2]+0.1, 0.1)
With your code:
name <- names(airquality)
classes<-sapply(airquality,class)
airquality[is.na(airquality)] <- 0
for (name in name[classes == 'numeric']) {
boxplot(airquality[,name])
text(x=1.25,y=fivenum(airquality[,name]), labels =fivenum(airquality[,name]))
text(x=0.75,y=median(airquality[,name]), labels=IQR(airquality[,name]))
arrows(0.77, fivenum(airquality[,name])[2], 0.77, fivenum(airquality[,name])[4], angle= 90 ,length=0.07,code=3)
}
The Plot is here, in this link: Boxplot with IQR rule

R Generic solution to create 2*2 confusion matrix

My question is related to this one on producing a confusion matrix in R with the table() function. I am looking for a solution without using a package (e.g. caret).
Let's say these are our predictions and labels in a binary classification problem:
predictions <- c(0.61, 0.36, 0.43, 0.14, 0.38, 0.24, 0.97, 0.89, 0.78, 0.86, 0.15, 0.52, 0.74, 0.24)
labels <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0)
For these values, the solution below works well to create a 2*2 confusion matrix for, let's say, threshold = 0.5:
# Confusion matrix for threshold = 0.5
conf_matrix <- as.matrix(table(predictions>0.5,labels))
conf_matrix
labels
0 1
FALSE 4 3
TRUE 2 5
However, I do not get a 2*2 matrix if I select any value that is smaller than min(predictions) or larger than max(predictions), since the data won't have either a FALSE or TRUE occurrence e.g.:
conf_matrix <- as.matrix(table(predictions>0.05,labels))
conf_matrix
labels
0 1
TRUE 6 8
I need a method that consistently produces a 2*2 confusion matrix for all possible thresholds (decision boundaries) between 0 and 1, as I use this as an input in an optimisation. Is there a way I can tweak the table function so it always returns a 2*2 matrix here?
You can make your thresholded prediction a factor variable to achieve this:
(conf_matrix <- as.matrix(table(factor(predictions>0.05, levels=c(F, T)), labels)))
# labels
# 0 1
# FALSE 0 0
# TRUE 6 8

Resources