Spatial clustering/sampling in R - r

I have a spatial data frame in R. we have a class imbalance problem so I want to be able to remove the positive cases (our response variable is binary and positive is approx 10% of the dataset) and then select a portion of the negative cases to combat the class imbalance in the model. I want to select negative cases that are closely related spatially and I am really struggling to figure out how.
Some ideas I have thought about which may work
KNN to cluster negative cases
Overlay spatial grid and extract x number of samples from each grid square
Buffer analysis and randomly select within buffer
If anyone has recommendations on how to execute this in R it would be awesome.
thanks

Just going to answer here in case anyone else searches this.
I decided to go with a kmeans cluster and then added the cluster as a col to the dB and randomly sampled from clusters.
Code below!
##CLuster analysis set.seed(1) clusdb <- W_neg[c(
"x_coor_farm", "y_coor_farm",
"Area_Farm_SqM", "NatGrass_1km_buff",
"BioFor_1km_buff", "MixedFor_1km_buff",
"Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436")]
##Write functuon to loop the algorithim kmean_withinss <- function(k) { cluster <- kmeans(clusdb, k) return (cluster$tot.withinss) }
# Set maximum cluster max_k <-20
# Run algorithm over a range of k wss <- sapply(2:max_k, kmean_withinss)
#Dataframe of kmeans output to find optimal K elbow <-data.frame(2:max_k, wss)
#plot library(ggplot2) ggplot(elbow, aes(x = X2.max_k, y = wss)) + geom_point() + geom_line() + scale_x_continuous(breaks = seq(1, 20, by = 1))
#Optimal K = 8
#Re-run the model with optimal K
pc_cluster_2 <-kmeans(clusdb, 8) pc_cluster_2$cluster pc_cluster_2$centers pc_cluster_2$size
pc_cluster_2$totss pc_cluster_2$betweenss
pc_cluster_2$betweenss/pc_cluster_2$totss*100
#92%
#Add col to dataframe W_neg$cluster <-pc_cluster_2$cluster
W_neg <- W_neg[c("TB2017", "x_coor_farm", "y_coor_farm", "Area_Farm_SqM", "NatGrass_1km_buff", "BioFor_1km_buff", "MixedFor_1km_buff", "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436", "cluster")]
ggplot(data = W_neg, aes(y = cluster)) + geom_bar(aes(fill = TB2017)) + ggtitle("Count of Clusters by Region") + theme(plot.title = element_text(hjust = 0.5))
fviz_cluster(pc_cluster_2, data = scale(clusdb), geom = c("point"),ellipse.type = "euclid")

Related

Export results from LOESS plot

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

A general solution to analyze and plot two data frames with varying lengths?

Could you please help me?
I'm writing a code in R to automatize a null model analysis of multiple networks. First, the code reads multiple TXT matrices into R. Second, it calculates a topological metric for each network. Third, it randomizes each network N times using a null model. Fourth, it calculates the same topological metric for all randomized versions of the original matrices.
In the fifth and final step, the idea is to compare the observed scores against the distributions of randomized scores. First, by doing a simple count of how many randomized scores are above or below the observed score, in order to estimate the P-values. Second, by plotting the distribution of randomized scores as a density and adding a vertical line to show the observed score.
Here are examples of the data frames that need to be analyzed:
networks <- paste("network", rep(1:3), sep = "")
randomizations <- seq(1:10)
observed.ex <- data.frame(network = networks,
observed = runif(3, min = 0, max = 1))
randomized.ex <- data.frame(network = sort(rep(networks, 10)),
randomization = rep(randomizations, 3),
randomized = rnorm(length(networks)*
length(randomizations),
mean = 0.5, sd = 0.1))
In the first step of the final analysis, the code estimates the P-values by doing simple counts. As you see, I need to make copies of the calculation call for each network:
randomized.network1 <- subset(randomized.ex, network == "network1")
sum(randomized.network1$randomized >= observed.ex$observed[1]) /
length(randomized.network1$randomized)
sum(randomized.network1$randomized <= observed.ex$observed[1]) /
length(randomized.network1$randomized)
randomized.network2 <- subset(randomized.ex, network == "network2")
sum(randomized.network2$randomized >= observed.ex$observed[2]) /
length(randomized.network2$randomized)
sum(randomized.network2$randomized <= observed.ex$observed[2]) /
length(randomized.network2$randomized)
randomized.network3 <- subset(randomized.ex, network == "network3")
sum(randomized.network3$randomized >= observed.ex$observed[3]) /
length(randomized.network3$randomized)
sum(randomized.network3$randomized <= observed.ex$observed[3]) /
length(randomized.network3$randomized)
In the second step of the final analysis, the code makes density plots. As you see, I need to make copies of the vertical line call for each network:
ggplot(randomized.ex, aes(randomized)) +
geom_density() +
facet_grid(network~.) +
geom_vline(data=filter(randomized.ex, network == "network1"),
aes(xintercept = observed.ex$observed[1]), colour = "red") +
geom_vline(data=filter(randomized.ex, network == "network2"),
aes(xintercept = observed.ex$observed[2]), colour = "red") +
geom_vline(data=filter(randomized.ex, network == "network3"),
aes(xintercept = observed.ex$observed[3]), colour = "red")
Is there a way to make this final analysis more general, so it always does the same calculations and plots, no matter how many networks are read in the beginning?
Thank you very much!
It looks like this can be neatly wrapped in an lapply loop that iterates over each file. How does the below work for you? You could also pass in filenames rather than the number of files (currently 1:3) and have the first line "read" in your TXT matrices.
library(dplyr) #For %>%, group_by, and summarize
output <- lapply(1:3, function(network_num){
network <- paste0("network", network_num)
n_randomizations <- 10
observed.ex <- runif(1)
randomized.ex <- rnorm(n_randomizations, mean = 0.5, sd = 0.1)
return(data.frame(network=network, observed=observed.ex, randomized=randomized.ex))
}) %>% do.call(what = rbind)
output %>%
group_by(network) %>%
summarize(p_value=mean(observed>=randomized))
ggplot(output) +
geom_density(aes(randomized)) +
facet_grid(network~.) +
geom_vline(aes(xintercept = observed), col="red")

Extract critical points of a polynomial model object in R?

I am trying to solve for the inflection points of a cubic polynomial function which has been fitted to data, i.e. values of x where the first derivative is zero.
I also need a way to find the values of y at the critical points of x.
It is easy enough to fit the model using lm() and to view the model quality with summary(). And I can plot the function easily enough by adding predictions and using geom_line().
There must be a package or a base R function dedicated to this problem. Can anyone suggest a method?
Below is a reprex to depict the problem. Needless to say, the arrows are drawn only to illustrate the question; they are not mapped to the true inflection points or I would not be asking this question...
library(tidyverse)
library(modelr)
set.seed(0)
#generate random data and plot the values
df <- tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
df %>% ggplot(aes(x, y)) +
geom_point()
# fit a model to the data
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
# plot the fitted model
df %>%
add_predictions(model = cubic_poly_model) %>%
ggplot(aes(x, y))+
geom_point(alpha=1/3)+
geom_line(aes(x, y=pred))+
annotate('text', label= 'critical point A', x=-50, y=-250000)+
geom_segment(x=-50, xend=-10, y=-200000, yend=-5000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))+
annotate('text', label= 'critical point B', x=140, y=400000)+
geom_segment(x=110, xend=90, y=300000, yend=100000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))
# But how can I get the critical values of x and the y values they produce?
Created on 2020-09-03 by the reprex package (v0.3.0)
I devised a solution using the mosaic package . The makeFun() function allows a model object to be converted to a function. You can then use base R optimize()to find the max or min value of that function over a specified interval (in this case, the range of x values). Specify the "maximum" argument in optimize() to state whether you want the local maximum or local minimum.
See code below:
library(magrittr)
set.seed(0)
#generate random data and plot the values
df <- tibble::tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
crit_values <- cubic_poly_model %>%
mosaic::makeFun() %>%
optimize(interval = c(min(df$x), max(df$x)), maximum = TRUE)
funct_crit_x <- crit_values[['maximum']][[1]]
funct_max <- crit_values[['objective']]
funct_crit_x
funct_max

Trying to Plot Standard Normal Distribution and t-distribution on same graph

dat <- data.frame(dens = c(rnorm(1000000), rt(1000000, 4)), lines = rep(c("a", "b"), each = 100000))
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)
This is my code. I'm trying to plot the two distributions on the same graph. I only ended up with the t distribution.
Any feedback would be appreciated. Thank you.
Like one of the comments say, this is basically a typo because you repeat the variables a and b 100,000 (a hundred thousand) times each, making the normally distributed numbers and the t-distributed numbers mixed up. You will need to set the argument each=1000000 (a million). Or to avoid confusion with how many zeros should be entered, just use 1e6 which means 1 * 10^6.

Average values of a point dataset to a grid dataset

I am relatively new to ggplot, so please forgive me if some of my problems are really simple or not solvable at all.
What I am trying to do is generate a "Heat Map" of a country where the filling of the shape is continous. Furthermore I have the shape of the country as .RData. I used hadley wickham's script to transform my SpatialPolygon data into a data frame. The long and lat data of my data frame now looks like this
head(my_df)
long lat group
6.527187 51.87055 0.1
6.531768 51.87206 0.1
6.541202 51.87656 0.1
6.553331 51.88271 0.1
This long/lat data draws the outline of Germany. The rest of the data frame is omitted here since I think it is not needed. I also have a second data frame of values for certain long/lat points. This looks like this
my_fixed_points
long lat value
12.817 48.917 0.04
8.533 52.017 0.034
8.683 50.117 0.02
7.217 49.483 0.0542
What I would like to do now, is colour each point of the map according to an average value over all the fixed points that lie within a certain distance of that point. That way I would get a (almost)continous colouring of the whole map of the country.
What I have so far is the map of the country plotted with ggplot2
ggplot(my_df,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") +
geom_path(color="white",aes(group=group)) + coord_equal()
My first Idea was to generate points that lie within the map that has been drawn and then calculate the value for every generated point my_generated_point like so
value_vector <- subset(my_fixed_points,
spDistsN1(cbind(my_fixed_points$long, my_fixed_points$lat),
c(my_generated_point$long, my_generated_point$lat), longlat=TRUE) < 50,
select = value)
point_value <- mean(value_vector)
I havent found a way to generate these points though. And as with the whole problem, I dont even know if it is possible to solve this way. My question now is if there exists a way to generate these points and/or if there is another way to come to a solution.
Solution
Thanks to Paul I almost got what I wanted. Here is an example with sample data for the Netherlands.
library(ggplot2)
library(sp)
library(automap)
library(rgdal)
library(scales)
#get the spatial data for the Netherlands
con <- url("http://gadm.org/data/rda/NLD_adm0.RData")
print(load(con))
close(con)
#transform them into the right format for autoKrige
gadm_t <- spTransform(gadm, CRS=CRS("+proj=merc +ellps=WGS84"))
#generate some random values that serve as fixed points
value_points <- spsample(gadm_t, type="stratified", n = 200)
values <- data.frame(value = rnorm(dim(coordinates(value_points))[1], 0 ,1))
value_df <- SpatialPointsDataFrame(value_points, values)
#generate a grid that can be estimated from the fixed points
grd = spsample(gadm_t, type = "regular", n = 4000)
kr <- autoKrige(value~1, value_df, grd)
dat = as.data.frame(kr$krige_output)
#draw the generated grid with the underlying map
ggplot(gadm_t,aes(long,lat)) + geom_polygon(aes(group=group), fill="white") + geom_path(color="white",aes(group=group)) + coord_equal() +
geom_tile(aes(x = x1, y = x2, fill = var1.pred), data = dat) + scale_fill_continuous(low = "white", high = muted("orange"), name = "value")
I think what you want is something along these lines. I predict that this homebrew is going to be terribly inefficient for large datasets, but it works on a small example dataset. I would look into kernel densities and maybe the raster package. But maybe this suits you well...
The following snippet of code calculates the mean value of cadmium concentration of a grid of points overlaying the original point dataset. Only points closer than 1000 m are considered.
library(sp)
library(ggplot2)
loadMeuse()
# Generate a grid to sample on
bb = bbox(meuse)
grd = spsample(meuse, type = "regular", n = 4000)
# Come up with mean cadmium value
# of all points < 1000m.
mn_value = sapply(1:length(grd), function(pt) {
d = spDistsN1(meuse, grd[pt,])
return(mean(meuse[d < 1000,]$cadmium))
})
# Make a new object
dat = data.frame(coordinates(grd), mn_value)
ggplot(aes(x = x1, y = x2, fill = mn_value), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
An alternative approach is to use an interpolation algorithm. One example is kriging. This is quite easy using the automap package (spot the self promotion :), I wrote the package):
library(automap)
kr = autoKrige(cadmium~1, meuse, meuse.grid)
dat = as.data.frame(kr$krige_output)
ggplot(aes(x = x, y = y, fill = var1.pred), data = dat) +
geom_tile() +
scale_fill_continuous(low = "white", high = muted("blue")) +
coord_equal()
which leads to the following image:
However, without knowledge as to what your goal is with this map, it is hard for me to see what you want exactly.
This slideshow offers another approach--see page 18 for a description of the approach and page 21 for a view of what the results looked like for the slide-maker.
Note however that the slide-maker used the sp package and the spplot function rather than ggplot2 and its plotting functions.

Resources