R package spatstat: How to use point process model covariate as factor starting with shape file - r

I have a question similar to this one from 2014, which was answered but the datasets are no longer available and our original data structures differ. (I'm in crunch time and stumped, so if you're able to respond quickly I would greatly appreciate it!!)
Goal: use the type of bedrock as a covariate in a Point Process Model (ppm) in spatstat with mine locations in Connecticut.
Data: the files are available from this Dropbox folder. The rock data and CT poly outline comes from UConn Magic Library, and the mine data comes from the USGS Mineral Resources Data System.
Approach:
I loaded some relevant packages and read in the shapefiles (and converted coords to match CT's system), and used the CT polygon as an owin object.
library(rgdal)
library(splancs)
library(spatstat)
library(sp)
library(raster)
library(geostatsp)
#read in shapefiles
ct <-readOGR(".","CONNECTICUT_STATE_POLY")
mrds <-readOGR(".","mrds-2017-02-20-23-30-58")
rock<-readOGR(".","bedrockpolyct_37800_0000_2000_s50_ctgnhs_1_shp_wgs84")
#convert mrds and rock to ct's coord system
tempcrs<-ct#proj4string
mrds<-spTransform(mrds,tempcrs)
rock<-spTransform(rock,tempcrs)
#turn ct shapefile into owin, call it w for window
w <-as.owin(ct)
#subset mrds data to just CT mines
mrdsCT <-subset(mrds,mrds#data$state=="Connecticut")
#ppm can't handle marked data yet, so need to unmark()
#create ppp object for mrds data, set window to w
mrdsCT.ppp <-as.ppp(mrdsCT)
Window(mrdsCT.ppp)<-w
From "Modelling Spatial Point Patterns in R" by Baddeley & Turner (page 39):
Unfortunately a pixel image in spatstat cannot have categorical (factor) values, because R refuses to create a factor-valued matrix. In order to represent a categorical variate as a pixel image, the categorical values should be encoded as integers (for efficiency’s sake) and assigned to an integer-valued pixel image. Then the model formula should invoke the factor command on this image. For example if fim is an image with integer values which represent levels of a factor, then:
ppm(X, ˜factor(f), Poisson(), covariates=list(f=fim))
There are several different types of rock classification included in the shapefile. I'm interested in LITHO1, which is a factor with 27 levels. It's the sixth attribute.
litho1<-rock[,6]
My (limited but researched) understanding is that I need to convert the shapefile to a raster, and later convert it to an image in order to be used in ppm. I created a mask from ct, and used that.
ctmask<-raster(ct, resolution=2000)
ctmask[!is.na(ctmask)] <- 0
litho1rast<-rasterize(litho1,ctmask)
After this point, I've tried several approaches and haven't had success just yet. I've attempted to follow the approaches laid out in the question linked, as well as search in documentation for relevant examples to adopt (factor, ratify, levels). Unlike the prior question, my data was already a factor, so it wasn't clear why I should apply the factor function to it.
Looking at litho1rast, the #data#attributes dataframe contains the following. If I plot it, it just plots the ID; levelplot function does plot LITHO1. When I would apply the factor functions, the ID would be retained but not LITHO1.
$ ID : int [1:1891] 1 2 3 4 5 6 7 8 9 10 ...
$ LITHO1: Factor w/ 27 levels "amphibolite",..: 23 16 23 16 23 16 24 23 16 24 ...
The ppm model would need an object class im, so I converted the raster to the im. I tried two ways. I can make ppm execute...but it treats every point as a factor rather than the 27 levels (with either litho1.im or litho1.im2) ...
litho1.im<-as.im(litho1rast)
litho1.im2<-as.im.RasterLayer(litho1rast)
model1=ppm(unmark(mrdsCT.ppp) ~ factor(COV1), covariates=list(COV1=litho1.im))
model1
So, I'm not quite sure where to go from here. It seems like I need to pass an argument to the as.im so that it knows to retain the LITHO1 not the ID. Clever ideas or leads to pertinent functions or approaches much appreciated!

The quoted statement from Baddeley & Turner is no longer true --- that quotation is from a very old set of workshop notes.
Pixel images of class im can have factor values (since 2011). If Z is an integer-valued pixel image (of class im), you can make it into a factor-valued image by setting levels(Z) <- lev where lev is the character vector of labels for the possible values.
You should not need to use rasterize: it should be possible to convert rock[,6] directly into a pixel image using as.im (after loading the maptools package).
See the book by Baddeley, Rubak and Turner (Spatial point patterns: methodology and applications with R, CRC Press, 2016) for a full explanation.

Looking at your code you don't seem to be providing the field argument to rasterize.
From rasterize help:
fieldnumeric or character. The value(s) to be transferred. This can
be a single number, or a vector of numbers that has the same length as
the number of spatial features (points, lines, polygons). If x is a
Spatial*DataFrame, this can be the column name of the variable to be
transferred. If missing, the attribute index is used (i.e. numbers
from 1 to the number of features). You can also provide a vector with
the same length as the number of spatial features, or a matrix where
the number of rows matches the number of spatial features
at this line:
litho1rast<-rasterize(litho1,ctmask)
you probably have to specify which column of the litho object to use in rasterization. Something like:
litho1rast<-rasterize(litho1,ctmask, field = "LITHO1")

Related

r Terra issue with multicategorical raster. How to properly extract the categories and their values into layers without losing data?

I am working with rTerra and having an issue with the CONUS historical disturbance dataset from LANDFIRE found here:https://landfire.gov/version_download.php (HDist is the name). To summarize what I want to do, I want to take this dataset, crop and project to my extent, then take the values of the cells and separate them as layers. So I want a layer for severity, one for disturbance type, etc. The historical disturbance data has these things all in one attribute table. In terra, this attribute table is set up under categories and this is providing a lot of problems. I have not had issues with the crop nor reproject, it is getting into the values and separating the categories into layers. I have the following code
library(terra)
setwd("your pathway to historical disturbance tif here")
h1 <- terra::rast("LC16_HDst_200.tif") #read in the Hdist tif
h2 <- terra::project(h1, "EPSG:5070", method = "near") #project it using nearest neighbor
h3 <- crop(h2, ext([xmin,xmax,ymin,ymax]) #crop to the extent
h3
This then gives the output in the extent and projection I want but the main focus is the categories
categories : Count, HDIST_ID, DISTCODE_V, DIST_TYPE, TYPE_CONFI, SEVERITY, SEV_CONFID, HDIST_CAT, FDIST, R, G, B
So I learned that with these kinds of datasets, the values are stored under these categories.
if I plot with plot(h3)
I only get the first row of the count category. In order to switch that category I can use
activeCat(h3) <- 4
h3
and I would get
name : DIST_TYPE
min value : Clearcut
max value : Wildland Fire Use
The default active category was count, but now its DIST_TYPE, the fourth category, nothing too crazy. I try plotting
plot(h3)
I only get NoData plotted. None of the others. There is a function called catalyze() That claims to take your categories and converts them all into numerical layers
h4 <- catalyze(h3)
which gave me a thirteen layer dataset, which makes sense because there are 13 categories and it takes them and converts them into numeric layers. I tried plotting
plot(h4, 4) #plot h4 layer 4, which would correspond to DIST_TYPE category
it only plots a value of 8, and it looks to only show what is likely noData values. The map is mostly green, which is inline with the NoData from HDist.
Anytime I try directly accessing values, it crashes. When I look at the min and max values I get 8 and 8 for min and max for that 'name" names: DIST_TYPE min values: 8 max values: 8. Other categories show a similar pattern. So it appeared to just take the first row of values for each category and make that the entire layer.
In summary, it is clear that terra stores all of the values that would easily be seen in an attribute table if the dataset were brought into arcgis. However, whenever I try to plot it or work with it, even before any real manipulation, it only accesses the top row of that attribute table, and when I catalyze, it just seems to mess everything up even more. I know this is really easy to solve in arcgis pro, but I want to keep everything in r from a documentation coherency standpoint. Any terra whizzes know what to do about this? I figure it has to be something very easy, but I don't really know what else to try. Maybe it is some major issue too. I have the same issue with LANDFIRE evt data. I have not had this issue with simple rasters such as dem, canopy cover, etc. It is only with these rasters with multiple categories (or columns in an attribute table)
edit
this is the break image
That failed because the (ESRI) VAT IDs are not in the expected (for GDAL categories) 0..255 range. This has now been fixed and I get:
library(terra)
#terra version 1.4.6
r <- rast("LC16_HDst_200.tif")
activeCat(r) <- 4
r <- crop(r, ext(-93345, -57075, 1693125, 1716735))
plot(r)

How can I fit my data and question in a script for the ace function of the ape pakage in Rstudio?

I have 96 amino acid sequences which I aligned with MAFFT and trimmed manually (FASTA format), choose the model of amino acid substitution with ProtTest (LG+I+G model), did the phylogenetic reconstruction with MEGAX (ML method, bootstrap test 1000 replicates, tree in Newick format) and the ancestral reconstruction with PAML, in a total of 664 final amino acid positions. However, my alignment has indels. I am naming each indel with a letter (A to T) and the respective amido acid positions range: A:89-92, B:66-67, C:181-186, D:208-208, E:214-219, F:244-250, G:237-296, H:278-280, I:295-295, J:329-334, K:345-349, L:371-375, M:390-425, N:432-433, O:440-443, P:480-480, Q:500-500, R:541-544, S:600-600. Both the initial and final parts of the sequences is very variable, so from positions 0 to 34 (initial) and 600 to 664 (final), each amino acid position may represent an indel.
I want to know, at each ancestral node, what is the probability that each indel is present in the ancestral sequence. I was told that the R-studio "ace" function on the package "ape - analysis of phylogenetics and evolution" can perform this task. I have installed both "ape" and "ggtree". I checked this webpage https://www.rdocumentation.org/packages/ape/versions/3.0-1/topics/ace, however, I have no idea how to construct the script. I am a biologist and newbie to R.
Can someone please help? Would be greatly appreciated, thanks.
It's hard to exactly figure out what you'll need from your example but the following could fit the general idea:
1 - Load your tree in R
For this step you can use the functions read.tree or read.nexus depending on your tree format: i.e. whether your phylogenetic software outputs a NEXUS file (usually the first line in these files is #NEXUS and the last line is end; or END;) or a newick output (usually, the first line directly starts with the phylogeny like ((my_species... and finishes with ;). You can locate this file and then read it in R using:
## Loading the package
library(ape)
## Reading the tree
my_tree <- read.tree("<the_path_to_your_file>")
2 - Load your trait data in R
You will then need to load your trait data (for example the indels positions you've listed above) as a matrix or a data.frame. The easiest is to have them in a .csv format ("comma separated values") that you can then read in R using the function read.csv:
## Reading the variables as a matrix
my_variables <- read.csv("<the_path_to_your_file>")
3 - Running an ancestral character estimation
And finally you can run your ancestral character estimation for each of your variable using the ace function from the package ape:
## Selecting the variable of interest (e.g. the first column of the dataset)
one_variable <- my_variables[, 1]
## Running the ancestral character estimation for this variable
my_ace <- ace(x = one_variable, phy = my_tree, type = "discrete")
## Looking at the results
my_ace
Of course there is much more to it but hopefully this could get you starting.

How to use Chao1? (BEGINNER)

I am using R for the first time. I am trying to use the Chao1 function to estimate the diversity of my dataset. I have 20 columns, one for each species, and 8 rows (nine if you include the header), one for each plot. Each cell has a number, which is the number of individuals of that species found in that plot. For example, in my Excel file, cell A2 has the value "8", which means that 8 individuals of Species1 were found in the first plot.
I have downloaded the Fossil and Vegan packages, where I believe the Chao1 function is located. They are active in my library. I have imported my dataset as "speciesabund". I am now trying to run Chao1. According to the description (https://artax.karlin.mff.cuni.cz/r-help/library/fossil/html/chao1.html) I'm supposed to type
chao1(x, taxa.row = TRUE)
I assumed "x" was meant to represent my dataset, so I tried
chao1(speciesabund, taxa.row = TRUE)
instead. It did not work and returned me "Error: Unsupported use of matrix or array for column indexing." I assume this means that I need to do something more to my data before trying to use the Chao function, is that correct? If so, how do I do this?
Thank you so much for your help! I am using this for the first time, so I'm sorry if my question is dumb.

Sequence manipulation

I have a matrix that is equivalent to a 96-well plate commonly used in microbiology. In that matrix I randomized 12 treatments each 8 times. I printed a kind of guide in order to follow the patten in the lab easily, and then after measurements I merged the randomized plate to the data.
cipPlate <- c(rep(c(seq(0,50,5),"E"),8)); cipPlate
rcipPlate <- array(sample(cipPlate),dim=c(8,12),dimnames=dimna); rcipPlate
platecCIP <- melt(rcipPlate); platecCIPbbCIP
WellCIP <- paste(platecCIP$Var1,platecSTR$Var2,sep=''); WellCIP
bbCIP <- data.frame(Well=WellCIP,ID=platecCIP$value); bbCIP
That works fine, except that the numbers in the sequence created for this are characters instead of integers. Then when i try to use ggplo2 to plot this with the measurements instead of plotting in the x -axis (o,5,10,15,...,50) it goes (0,10,15,...,45,5,50)
Is there a way to avoid this, or to make the numbers inside these sequence to represent the actual number as an integer instead of a character
BTW: sorry for the clumpsy code, I'm not an expert and it works good enough so that i can use it further.

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Resources