gstat in R - Variogram cutoff distance is not working at larger specified distances with large gridded datasets - r

I am attempting to compute variograms in R with the gstat package of biomass data across management areas. The biomass data is a raster dataset with a 3.5 ft resolution or 1.0668m. The size of the spatialpointsDataFrame I am passing to the variogram function is 18.6 Mb (814223 elements). (I have also tried the spatialpixelsDataFrame, but it does not like the 1.0668m pixel size). When I run the code:
v = variogram(ras.grid1#data[[1]]~1, data = ras.grid1)
and look at output "v", I get distance values that are much larger than the management area (and much larger than 1/3 of the diagonal length).
When I run the variogram function on smaller management units (40 ha) it gives me results that I would expect (this is using a SpatialPointsDataFrame with the size of 7.9 Mb and 344259 elements).
If I hard code the cutoff to be smaller, with the initial larger raster dataset to 200m, it again provides distance values I expect. If I try upping the distance let's say 600m again it provides distance values much larger than the 600m cutoff specified. 300m also provides unexpected results. For example:
####variogram computation with 200m cutoff....It works
v = variogram(ras.grid1#data[[1]]~1, data = ras.grid1, cutoff=200)
v
np dist gamma dir.hor dir.ver id
1 195954282 8.874169 4990.504 0 0 var1
2 572500880 20.621626 5627.534 0 0 var1
3 958185761 33.701344 5996.423 0 0 var1
4 1288501796 46.920392 6264.396 0 0 var1
5 1652274803 60.198360 6472.187 0 0 var1
6 1947750363 73.502011 6642.960 0 0 var1
7 2282469596 86.807781 6802.124 0 0 var1
8 2551355646 100.131946 6942.277 0 0 var1
9 2849678492 113.441335 7049.838 0 0 var1
10 3093057361 126.751400 7149.102 0 0 var1
11 3375989515 140.081110 7240.848 0 0 var1
12 3585116223 153.418095 7322.990 0 0 var1
13 3821495516 166.721460 7394.616 0 0 var1
14 4036375072 180.053643 7443.040 0 0 var1
15 4235205167 193.389119 7476.061 0 0 var1
####variogram computation with 600m cutoff....It returns unexpected
####distance values
v2 = variogram(ras.grid1#data[[1]]~1, data = ras.grid1, cutoff=600)
v2
np dist gamma dir.hor dir.ver id
1 1726640923 26.54691 5759.951 0 0 var1
2 593559666 510.62232 53413.914 0 0 var1
3 3388536438 229.26702 15737.659 0 0 var1
4 1464228507 966.36789 49726.788 0 0 var1
5 3503141163 623.13559 25680.965 0 0 var1
6 878031648 3454.21122 117680.266 0 0 var1
7 2233138601 1761.91799 50996.719 0 0 var1
8 3266098834 1484.40162 37369.451 0 0 var1
9 4056578316 1420.49358 31556.527 0 0 var1
10 254561085 26030.66780 517601.669 0 0 var1
11 562144107 13256.59985 239163.649 0 0 var1
12 557621435 14631.84504 243476.857 0 0 var1
13 385648032 22771.12890 352898.971 0 0 var1
14 4285655256 2163.11091 31213.201 0 0 var1
15 3744542323 2575.19496 34709.529 0 0 var1
Also if I scale the data up to 3m I again get the expected distance values.
I am not sure if the large size of raster dataset is causing the issue and what I am trying to do is not possible, or if I doing something wrong or if there is another way?
Thank you for the help and interest.

After exploring this in more detail, it does seem to be the size of the SpatialPointsDataFrame causing the issue. On my machine keeping the size under 10 Mb seemed to do the trick. To reduce the size of the SpatialPointsDataFrame I sampled the original raster using:
ras.grid<-ras.grid[sample(1:length(ras.grid), 350000),]

Related

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

Ordination plot with vectors coded as centroids

In the vegan package, I tried to make a ordination plot with species as objects and environmental variables as vectors. However the environmental variables are regarded as centroids instead of factors. Strangely each data frame cell is seen as an environmental factor, so I think the dataframe is not structured correctly. When I plot the ordination without environmental variables I don't get any problems.
summary(gutter.dca)
environfit = envfit(gutter.dca,gutterenv)
> head(environfit)
$vectors
NULL
$factors
Centroids:
DCA1 DCA2
vocht0,246435845 -0.2185 -1.0601
vocht0,249249249 0.1932 -1.1339
vocht0,251497006 0.0331 -2.0888
vocht0,264735265 -0.3353 -1.3403
vocht0,26911315 -0.0017 -0.9498
vocht0,272369715 -1.0733 0.0021
Species dataframe
head(gutter)
Acer.campestre Acer.pseudoplantanus Adoxa.moschatellina Aegopodium.podagraria Ajuga.reptans
Q1-1 0 0 5 0 0
Q1-2 0 70 15 20 0
Q1-3 0 15 0 0 0
Q1-4 0 3 0 0 0
Q2-1 0 3 0 0 0
Q2-2 1 0 0 0 0
Environmental variables dataframe
head(gutterenv)
vocht Ph.H2O ph.KCl mg.NO3.kg.soil mg.NH4.N.kg.soil litter.depth..cm.
1 0,26911315 7,41 6,686 2,811031105 4,674304351 7,5
2 0,246435845 7,225 6,349 2,567981088 6,735395066 6,5
3 0,264735265 7,001 6,491 2,336821354 8,400116244 5,1
4 0,325123153 6,732 5,444 2,518858082 7,684506342 8,25
5 0,446875 6,87 7,45 2,443686352 9,886923756 4
6 0,548476454 8,1 7,05 3,144954614 11,3179919 3

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0
I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.
max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin
Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

Resources