R: Smoothing Data (LargeDataset - A For Loop is Too Slow) - r

I'm aware there are many questions related to smoothing data in R, however, my knowledge is far too basic to apply it to the following problem! My key issue is that my data is >1.7m rows.
My Problem
I have a list "df" of 4 equal length vectors.
df[[1]] is a vector containing all uk postcodes
df[[2]] is a vector of latitudes
df[[3]] is a vector of longitudes
df[[4]] contains concentrations of a certain material
What I need to do is create a vector of 'smoothed' concentrations for each postcode, which should be calculated as: "A weighted average of concentrations in all postcodes within a given distance. The weighting is defined as exp(-Distance)"
I currently have the following code. It works perfectly (I've tested on a subset of 100k postcodes). However, it's far too slow, given the fact it loops over almost 2 million entries.
Can anyone help me finding a faster way to do this?
df <- as.list(Import[,c("Postcode", "Latitude", "Longitude", "Concentration")])
n <- length(df[[1]])
Out <- rep(0,n)
for(i in 1:n){
#Calculate squared Euclidean Distance
BaseLat <- df[[2]][i]
BaseLong <- df[[3]][i]
Distance <- (df[[2]]-BaseLat)^2 + (df[[3]]-BaseLong)^2
#Weightings
Weight <- ifelse(Distance < 0.01, exp(-Distance), 0)
#Take average rate and assign to output vector
Out[i] <- sum(df[[4]]*Weight)/sum(Weight)
}

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

What causes the difference between calc and cellStats in raster calculations in R?

I am working with a dataset that consists of 20 layers, stacked in a RasterBrick (originating from an array). I have looked into the sum of the layers, calculated with both 'calc' and 'cellStats'. I have used calc to calculate the sum of the total values and cellStats to look at the average of the values per layer (useful for a time series).
However, when I sum the average of each layer, it is half the value of the other calculated sum. What causes this difference? What am I overlooking?
Code looks like this:
testarray <- runif(54214776,0,1)
# Although testarray should contain a raster of 127x147 with 2904 time layers.
# Not really sure how to create that yet.
for (i in 1830:1849){
slice<-array2[,,i]
r <- raster(nrow=(127*5), ncol=(147*5), resolution =5, ext=ext1, vals=slice)
x <- stack(x , r)
}
brickhp2 <- brick(x)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2[r_sumhp2<= 0] <- NA
SWEavgpertimestepM <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
The goal is to compare the sum of the layers calculated with 'calc(x, sum)' with the sum of the mean calculated with 'cellStats(x, mean)'.
Rasterbrick looks like this (600kb, GTiff) : http://www.filedropper.com/brickhp2
*If there is a better way to share this, please let me know.
The confusion comes as you are using calc which operates pixel-wise on a brick (i.e. performs the calculation on the 20 values at each pixel and returns a single raster layer) and cellStats which performs the calculation on each raster layer individually and returns a single values for each layer. You can see that the results are comparable if you use this code:
library(raster)
##set seed so you get the same runif vals
set.seed(999)
##create example rasters
ls=list()
for (i in 1:20){
r <- raster(nrow=(127*5), ncol=(147*5), vals=runif(127*5*147*5))
ls[[i]] <- r
}
##create raster brick
brickhp2 <- brick(ls)
##calc sum (pixel-wise)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2 ##returns raster layer
##calc mean (layer-wise)
r_meanhp2 <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
r_meanhp2 ##returns vector of length nlayers(brickhp2)
##to get equivalent values you need to divide r_sumhp2 by the number of layers
##and then calculate the mean
cellStats(r_sumhp2/nlayers(brickhp2),stat="mean")
[1] 0.4999381
##and for r_meanhp2 you need to calculate the mean of the means
mean(r_meanhp2)
[1] 0.4999381
You will need to determine for yourself if you want to use the pixel or layer wise result for your application.

Is it possible to include NA for one missing value, and can I verify that each individual corresponds with my specified function

I have code here that generates a random spatial distribution of points, returns a distance column between every point and an infected individual and uses a function to calculate the probability of infection in the next time step. There are 60 hosts, one of which is infected. I would like to bind the values of Pi (which calculates infection probability) to my data frame with the original co-ordinates. Obviously one point is removed from the distance matrix, the infected individual. This value I would like to replace with NA in the main data frame as the next step in my code, and also to confirm that the co-ordinates correspond with the output of the function Pi.
So as it stands I am trying to attach a column of 59 rows to the main data frame of 60 rows.
# Create a spatial distribution with infected individuals
xcoord <- sample(1:100,60)
ycoord <- sample(1:100,60)
infectionstatus <- rep(0,60)
Df <- data.frame(xcoord, ycoord, infectionstatus)
a <- sample(1:60, 1)
Df$infectionstatus[a] <- 1
# Calculate distance between infected individuals and susceptibles
library(rdist)
distances <- pdist(Df[,1:2], metric = "euclidean")
position_infected_individual <- which(Df[,3]==1)
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
#Assign parameter values and calculate probability of infection
beta<-100
alpha<-0.1
Pi<-vector()
for (p in 1:length(distance_from_infected)){
Pi[p] = 1-exp(-beta*exp(-alpha*distance_from_infected[p]))
}
The obvious step is:
replace:
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
with:
distance_from_infected <- c(NA, distances[-(position_infected_individual), position_infected_individual])
But you're setting yourself up for quite a few failures.
Assuming only one infected case
That the DF can always be appropriately sorted so infected individual is first
That NA makes "sense" for this kind of numeric summary

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

Resources