Accessing R Self Organizing Map Codebook Vectors - r

I am working to use SOMs to help analyze variability of weather forecast model ensembles. To do so, access a 20 ensemble global weather forecast model over a specific geographic domain. I convert a 20 x Nlat x Nlon matrix to a 20 x Nlat*Nlon matrix, present it to the Kohonen package som function. I then seek to access the som "codebook vector" output and transform it back to the latitude longitude grid. I am, however, receiving an error message at this step.
The error message I recieve is:
'Error in var.som$codes[i, ] : incorrect number of dimensions.' In this case, var.som is the Kohonen object. I loop from N = 1:Nsom, where Nsom is the number of "maps" specified in the call to the som function.
The attribute data for var.som indicates the size of the list var.som$codes is
"num [1:4, 1:500]", suggesting two dimensions, which is why I think my code should work. I have tried different permutations to access the list data but none work. I.e. var.som$codes[1], and var.som$codes[[1]] but they do not solve the problem. var.som$codes[1,1] yields NULL.
In the R script below, I have reduced the process to only the essential steps. A random number generator replaces the accessing of the weather model data. In the code, I indicate the location of where the error occurs and what the error message is.
Help and guidance for how to access the var.som$codes one codebook vector at a time is appreciated.
# An R script that provides an example of using a Self Organizing Map to calucate a SOM from latitude/longitude
# data. An error occurs fails accessing the SOM data vector codes.
library("kohonen")
# Set a few parameters
Nlon <- 20 # Number of longitude points
Nlat <- 25 # Number of latitude points
Nens <- 20 # number of ensemble members
Nsom <- 4 # number of "maps" in SOM
t2m.en <- as.list(rep(0,Nens))
# Generate Nlon * Nlat random numbers for Nens ensembles
for (i in 1:Nens) {
t2m.en[[i]] <- runif(Nlon*Nlat, -5, 5)
}
#array containing ensemble data
t2m.ens <- array(unlist(t2m.en),dim=c(20,Nlon,Nlat))
t2m.vec <- matrix(t2m.ens, nrow=20, ncol=Nlat*Nlon, byrow=TRUE)
# remove the column mean from each column of data (i.e. each grid point)
t2m.scaled <- apply(t2m.vec, 2, scale, scale=FALSE, center=TRUE)
rm(t2m.en)
# LOOP OVER THE VARIABLES TO PLOT
# Conduct the SOM analysis
var.som <- som(t2m.scaled, grid = somgrid(2,2, "rectangular"))#, keep.data=TRUE))
var.vecc = mat.or.vec(Nlat*Nlon, Nsom)
#populate var.vecc with the SOM output maps
for (i in 1:Nsom) {
print(i)
## THIS IS WHERE THE ERROR IS
var.vecc[,i] <- var.som$codes[i,]
## The Error Message is:
## Error in var.som$codes[i, ] : incorrect number of dimensions
}
#var.som$codes[1]
# Plot data from var.vecc on a map

Try: var.vecc[,i]<-var.som$codes[[1]][i,]

Related

R won't run model because it insists the data is not an UnmarkedFrame Occu Subject

I am trying to create a dual-species occupancy model using unmarkedFrameOccuMulti. I've been successful in producing the UMF and have even got a basic plot of the detections but when I try to run an individual model I get the error message;
Error in occu(~1, ~Vill_Dist, umf) : Data is not an unmarkedFrameOccu object.
I've made sure the csvs have the same number of rows etc. I'm a bit mythed because I can't find much online and the UMF itself has ran perfectly, just R can't seem to seperate out the aspects of it?
S <- 2 # number of species
M <- 354 #number of sites - i.e. number of sites with actual data (#i.e. not NAs/transects that were taken - some transects were done 14 times, others as little as 2 times)
J <- 9.07 #average number of visits per transect
y <- list(matrix(rbinom(354, 1, 0.456)), #species 1 leopard
matrix(rbinom(354, 1, 0.033))) #species 2 wolf
So the above is code I'm following from the R help on unmarkedoccumulti. The ordering of the numbers is based on the rbinom function. i.e. 0.033% of the sites surveyed wolves were seen.
obscov <- read.csv("grazcov2.csv")
Error message is ObsCovData needs M*obsNum of rows
umf <- unmarkedFrameOccuMulti(y=y, siteCovs = predcovs2, obsCovs = NULL)
predcovs2
summary(umf)
plot(umf)
umf
m1 <- occu(~1, ~Vill_Dist, umf) - this is the code that doesn't work - Vill_Dist being one of the covariates in the csv - spellt correctly/same etc.
I was expected to produce a model that would predict occurence of leopards/wolves based off the covariates.
As I was writing this out I had an idea for what might be going wrong. I couldn't get the model to work previously because I was putting in the detection data in csv format rather than using the simple binomial function.
Is it simply that R cannot mix csv/imported data and the binomial data?

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

R: Autokrige.cv function in automap package generates NaNs

I’m fairly new to R and I am trying to make interpolations of temperature measurements that where gathered from different station across the Netherlands. I have data for about 35 stations that make measurements every 10 minutes covering a timespan of about two weeks. Accordingly, I figured it would be best to make a loop that takes care of this. To see how well the interpolation technique works I want to do a cross validation for every timestamp.
In order to do this I used the Autokrige function from the automap package, and next I used the compare.cv function from the automap package in order to get an overview of the most important statistics for all time stamps. Besides that, I made sure the cross validation is only done if at least 25 stations registred meassurements.
The problem however is, that my code as described below works most of the time but gives the following warnings in 4 cases:
1. In sqrt(ret[[var.name]]) : NaNs produced
2. In sqrt(ret[[var.name]]) : NaNs produced
3. In sqrt(ret[[var.name]]) : NaNs produced
4. In sqrt(ret[[var.name]]) : NaNs produced
When I try to use the compare.cv command for the total list including all the cross validations it gives me the following error:
"Error in quantile.default(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, :
missing values and NaN's not allowed if 'na.rm' is FALSE"
Im wondering what causes the Autokrige function to generate NaNs in the cross validation, and more importantly how I can remove them from the results.cv so that I can use the compare.cv function?
rm(list=ls())
# load packages
require(sp)
require(gstat)
require(ggmap)
require(automap)
require(ggplot2)
#load data (download link provided below)
load("download path") https://www.dropbox.com/s/qmi3loub29e55io/meassurements_aug.RDS?dl=0
# make data spatial and assign spatial coordinate system
coordinates(meassurements) = ~x+y
proj4string(meassurements) <- CRS("+init=epsg:4326")
meassurements_df <- as.data.frame(meassurements)
# loop for cross validation
timestamp <- meassurements$import_log_id
results.cv=list()
for (i in unique(timestamp)) {
x = meassurements_df[which(meassurements$import_log_id == i), ]
if(sum(!is.na(x$temperature)) > 25){
results.cv[[paste0(i)]] = autoKrige.cv (temperature ~ 1, meassurements[which(meassurements$import_log_id == i & !is.na(meassurements$temperature)), ])
}
}
# calculate key statistics (RMSE MAE etc)
compare.cv(results.cv)
Thanks!
I came across the same problem and solved it with the help of remove.duplicates() of package sp on the SpatialPointDataFrame used for kriging. Prior to that I calculated the mean of the relevant variables in the DataFrame.
SPDF#data <- SPDF#data %>%
group_by(varx,vary,varz) %>%
mutate_at(vars(one_of(relevant_var)),mean,na.rm=TRUE) %>%
ungroup()
SPDF <- SPDF %>% remove.duplicates()
At the time I was encountering the same problem the Dropbox link above was not working anymore, so I could not check this specific example.

Getting a constant answer while finding patterns with neuralnet package

I'm trying to find patterns in a large dataset using the neuralnet package.
My data file looks something like this (30,204,447 rows) :
id.company,EPS.or.Sales,FQ.or.FY,fiscal,date,value
000001,EPS,FY,2001,20020201,-5.520000
000001,SAL,FQ,2000,20020401,70.300003
000001,SAL,FY,2001,20020325,49.200001
000002,EPS,FQ,2008,20071009,-4.000000
000002,SAL,FY,2008,20071009,1.400000
I have split this initial file into four new files for annual/quarterly sales/EPS and it is on those files that I want to use neural networks to see if I can use the variables id.company, fiscal and date in the case below to predict the annual sales results.
To do so, I have written the following code:
dataset <- read.table("fy_sal_data.txt",header=T, sep="\t") #my file doesn't actually use comas as separators
#extract training set and testing set
trainset <- dataset[1:1000, ]
testset <- dataset[1001:2000, ]
#building the NN
ann <- neuralnet(value ~ id.company + fiscal + date, trainset, hidden = 3,
lifesign="minimal", threshold=0.01)
#testing the output
temp_test <- subset(testset, select=c("id.company", "fiscal", "date"))
ann.results <- compute(ann, temp_test)
#display the results
cleanoutput <- cbind(testset$value, as.data.frame(ann.results$net.result))
colnames(cleanoutput) <- c("Expected Output", "NN Output")
head(cleanoutput, 30)
Now my problem is that the compute function returns a constant answer no matter the inputs of the testing set.
Expected Output NN Output
1001 2006.500000 1417.796651
1002 2009.000000 1417.796651
1003 2006.500000 1417.796651
1004 2002.500000 1417.796651
I am very new to R and its neural networks packages but I have found online that some of the reasons for such results can be either:
an insufficient number of training examples (here I'm using a thousand ones but I've also tried using a million rows and the results were the same, only it took 4h to train)
or an error in the formula.
I am sure I'm doing something wrong but I can't seem to figure out what.

PCA using raster datasets in R

I have several large rasters that I want to process in a PCA (to produce summary rasters).
I have seen several examples whereby people seem to be simply calling prcomp or princomp. However, when I do this, I get the following error message:
Error in as.vector(data): no method for coercing this S4 class to a vector
Example code:
files<-list.files() # a set of rasters
layers<-stack(files) # using the raster package
pca<-prcomp(layers)
I have tried using a raster brick instead of stack but that doesn't seem to the issue. What method do I need to provide the command so that it can convert the raster data to vector format? I understand that there are ways to sample the raster and run the PCA from that, but I would really like to understand why the above method is not working.
Thanks!
The above method is not working simply because prcomp does not know how to deal with a raster object. It only knows how to deal with vectors, and coercing to vector does not work, hence the error.
What you need to do is read each of your files into a vector, and put each of the rasters in a column of a matrix. Each row will then be a time series of values at a single spatial location, and each column will be all the pixels at a certain time step. Note that the exact spatial coordinates are not needed in this approach. This matrix serves as the input of prcomp.
Reading the files can be done using readGDAL, and using as.data.frame to cast the spatial data to data.frame.
Answer to my own question: I ended up doing something slightly different: rather than using every raster cell as input (very large dataset), I took a sample of points, ran the PCA and then saved the output model so that I could make predictions for each grid cell…maybe not the best solution but it works:
rasters <- stack(myRasters)
sr <- sampleRandom(rasters, 5000) # sample 5000 random grid cells
# run PCA on random sample with correlation matrix
# retx=FALSE means don't save PCA scores
pca <- prcomp(sr, scale=TRUE, retx=FALSE)
# write PCA model to file
dput(pca, file=paste("./climate/", name, "/", name, "_pca.csv", sep=""))
x <- predict(rasters, pca, index=1:6) # create new rasters based on PCA predictions
There is rasterPCA function in RStoolbox package http://bleutner.github.io/RStoolbox/rstbx-docu/rasterPCA.html
For example:
library('raster')
library('RStoolbox')
rasters <- stack(myRasters)
pca1 <- rasterPCA(rasters)
pca2 <- rasterPCA(rasters, nSamples = 5000) # sample 5000 random grid cells
pca3 <- rasterPCA(rasters, norm = FALSE) # without normalization
here is a working solution:
library(raster)
filename <- system.file("external/rlogo.grd", package="raster")
r1 <- stack(filename)
pca<-princomp(r1[], cor=T)
res<-predict(pca,r1[])
Display result:
r2 <- raster(filename)
r2[]<-res[,1]
plot(r2)
Yet another option would be to extract the vales from the raster-stack, i.e.:
rasters <- stack(my_rasters)
values <- getValues(rasters)
pca <- prcomp(values, scale = TRUE)
Here is another approach that expands on the getValues approach proposed by #Daniel. The result is a raster stack. The index (idx) references non-NA positions so that NA values are accounted for.
library(raster)
r <- stack(system.file("external/rlogo.grd", package="raster"))
r.val <- getValues(r)
idx <- which(!is.na(r.val))
pca <- princomp(r.val, cor=T)
ncomp <- 2 # first two principle components
r.pca <- r[[1:ncomp]]
for(i in 1:ncomp) { r.pca[[i]][idx] <- pca$scores[,i] }
plot(r.pca)

Resources