Extracting data from lower layers in a Rasterbrick - r

So I'm extracting data from a rasterbrick I made using the method from this question: How to extract data from a RasterBrick?
In addition to obtaining the data from the layer given by the date, I want to extract the data from months prior. In my best guess I do this by doing something like this:
sapply(1:nrow(pts), function(i){extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i-1], nl=1)})
So it the extracting should look at layerindex i-1, this should then give the data for one month earlier. So a point with layerindex = 5, should look at layer 5-1 = 4.
However it doesn't do this and seems to give either some random number or a duplicate from months prior. What would be the correct way to go about this?

Your code is taking the value from the layer of the previous point, not the previous layer.
To see that imagine we are looking at the point in row 2 (i=2). your code that indicates the layer is pts$layerindex[i-1], which is pts$layerindex[1]. In other words, the layer of the point in row 1.
The fix is easy enough. For clarity I will write the function separetely:
foo = function(i) extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i]-1, nl=1)
sapply(1:nrow(pts), foo)
I have not tested it, but this should be all.

Related

Moving spatial data off gird cell corners

I have a seemingly simple question that I can’t seem to figure out. I have a large dataset of millions of data points. Each data point represents a single fish with its biological information as well as when and where it was caught. I am running some statistics on these data and have been having issues which I have finally tracked down to some data points having latitude and longitude values that fall exactly on the corners of the grid cells which I am using to bin my data. When these fish with lats and long that fall exactly onto grid cell corners are grouped into their appropriate grid cell, they end up being duplicated 4 times (one for each cell that touches the grid cell corner their lats and long identify).
Needless to say this is bad and I need to force those animals to have lats and long that don’t put them exactly on a grid cell corner. I realize there are probably lots of ways to correct something like this but what I really need is a simply way to identify latitudes and longitudes that have integer values, and then to modify them by a very small amount (randomly adding or subtracting) so as to shift them into a specific cell without creating a bias by shifting them all the same way.
I hope this explanation makes sense. I have included a very simple example in order to provide a workable problem.
fish <- data.frame(fish=1:10, lat=c(25,25,25,25.01,25.2,25.1,25.5,25.7,25,25),
long=c(140,140,140,140.23,140.01,140.44,140.2,140.05,140,140))
In this fish data frame there are 10 fish, each with an associated latitude and longitude. Fish 1, 2, 3, 9, and 10 have integer lat and long values that will place them exactly on the corners of my grid cells. I need some way of shifting just these values by something like plus are minus 0.01.
I can identify which lats or longs are integers easy enough with something like:
fish %>%
near(as.integer(fish$lat))
But am struggling to find a way to then modify all the integer values by some small amount.
To answer my own question I was able to work this out this morning with some pretty basic code, see below. All it takes is making a function that actually looks for whole numbers, where is.integer does not.
# Used to fix the is.integer function to actually work and not just look at syntax
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# Use ifelse to change only whole number values of lat and long
fish$jitter_lat <- ifelse(is.wholenumber(fish$lat), fish$lat+rnorm(fish$lat, mean=0, sd=0.01), fish$lat)
fish$jitter_long <- ifelse(is.wholenumber(fish$long), fish$long+rnorm(fish$long, mean=0, sd=0.01), fish$long)

r Terra issue with multicategorical raster. How to properly extract the categories and their values into layers without losing data?

I am working with rTerra and having an issue with the CONUS historical disturbance dataset from LANDFIRE found here:https://landfire.gov/version_download.php (HDist is the name). To summarize what I want to do, I want to take this dataset, crop and project to my extent, then take the values of the cells and separate them as layers. So I want a layer for severity, one for disturbance type, etc. The historical disturbance data has these things all in one attribute table. In terra, this attribute table is set up under categories and this is providing a lot of problems. I have not had issues with the crop nor reproject, it is getting into the values and separating the categories into layers. I have the following code
library(terra)
setwd("your pathway to historical disturbance tif here")
h1 <- terra::rast("LC16_HDst_200.tif") #read in the Hdist tif
h2 <- terra::project(h1, "EPSG:5070", method = "near") #project it using nearest neighbor
h3 <- crop(h2, ext([xmin,xmax,ymin,ymax]) #crop to the extent
h3
This then gives the output in the extent and projection I want but the main focus is the categories
categories : Count, HDIST_ID, DISTCODE_V, DIST_TYPE, TYPE_CONFI, SEVERITY, SEV_CONFID, HDIST_CAT, FDIST, R, G, B
So I learned that with these kinds of datasets, the values are stored under these categories.
if I plot with plot(h3)
I only get the first row of the count category. In order to switch that category I can use
activeCat(h3) <- 4
h3
and I would get
name : DIST_TYPE
min value : Clearcut
max value : Wildland Fire Use
The default active category was count, but now its DIST_TYPE, the fourth category, nothing too crazy. I try plotting
plot(h3)
I only get NoData plotted. None of the others. There is a function called catalyze() That claims to take your categories and converts them all into numerical layers
h4 <- catalyze(h3)
which gave me a thirteen layer dataset, which makes sense because there are 13 categories and it takes them and converts them into numeric layers. I tried plotting
plot(h4, 4) #plot h4 layer 4, which would correspond to DIST_TYPE category
it only plots a value of 8, and it looks to only show what is likely noData values. The map is mostly green, which is inline with the NoData from HDist.
Anytime I try directly accessing values, it crashes. When I look at the min and max values I get 8 and 8 for min and max for that 'name" names: DIST_TYPE min values: 8 max values: 8. Other categories show a similar pattern. So it appeared to just take the first row of values for each category and make that the entire layer.
In summary, it is clear that terra stores all of the values that would easily be seen in an attribute table if the dataset were brought into arcgis. However, whenever I try to plot it or work with it, even before any real manipulation, it only accesses the top row of that attribute table, and when I catalyze, it just seems to mess everything up even more. I know this is really easy to solve in arcgis pro, but I want to keep everything in r from a documentation coherency standpoint. Any terra whizzes know what to do about this? I figure it has to be something very easy, but I don't really know what else to try. Maybe it is some major issue too. I have the same issue with LANDFIRE evt data. I have not had this issue with simple rasters such as dem, canopy cover, etc. It is only with these rasters with multiple categories (or columns in an attribute table)
edit
this is the break image
That failed because the (ESRI) VAT IDs are not in the expected (for GDAL categories) 0..255 range. This has now been fixed and I get:
library(terra)
#terra version 1.4.6
r <- rast("LC16_HDst_200.tif")
activeCat(r) <- 4
r <- crop(r, ext(-93345, -57075, 1693125, 1716735))
plot(r)

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

Bourdet Derivative in R with Smoothing Window

I am calculating pressure derivatives using algorithms from this PDF:
Derivative Algorithms
I have been able to implement the "two-points" and "three-consecutive-points" methods relatively easily using dplyr's lag/lead functions to offset the original columns forward and back one row.
The issue with those two methods is that there can be a ton of noise in the high resolution data we use. This is why there is the third method, "three-smoothed-points" which is significantly more difficult to implement. There is a user-defined "window width",W, that is typically between 0 and 0.5. The algorithm chooses point_L and point_R as being the first ones such that ln(deltaP/deltaP_L) > W and ln(deltaP/deltaP_R) > W. Here is what I have so far:
#If necessary install DPLYR
#install.packages("dplyr")
library(dplyr)
#Create initial Data Frame
elapsedTime <- c(0.09583, 0.10833, 0.12083, 0.13333, 0.14583, 0.1680,
0.18383, 0.25583)
deltaP <- c(71.95, 80.68, 88.39, 97.12, 104.24, 108.34, 110.67, 122.29)
df <- data.frame(elapsedTime,deltaP)
#Shift the elapsedTime and deltaP columns forward and back one row
df$lagTime <- lag(df$elapsedTime,1)
df$leadTime <- lead(df$elapsedTime,1)
df$lagP <- lag(df$deltaP,1)
df$leadP <- lead(df$deltaP,1)
#Calculate the 2 and 3 point derivatives using nearest neighbors
df$TwoPtDer <- (df$leadP - df$lagP) / log(df$leadTime/df$lagTime)
df$ThreeConsDer <- ((df$deltaP-df$lagP)/(log(df$elapsedTime/df$lagTime)))*
((log(df$leadTime/df$elapsedTime))/(log(df$leadTime/df$lagTime))) +
((df$leadP-df$deltaP)/(log(df$leadTime/df$elapsedTime)))*
((log(df$elapsedTime/df$lagTime))/(log(df$leadTime/df$lagTime)))
#Calculate the window value for the current 1 row shift
df$lnDeltaT_left <- abs(log(df$elapsedTime/df$lagTime))
df$lnDeltaT_right <- abs(log(df$elapsedTime/df$leadTime))
Resulting Data Table
If you look at the picture linked above, you will see that based on a W of 0.1, only row 2 matches this criteria for both the left and right point. Just FYI, this data set is an extension of the data used in example 2.5 in the referenced PDF.
So, my ultimate question is this:
How can I choose the correct point_L and point_R such that they meet the above criteria? My initial thoughts are some kind of while loop, but being an inexperienced programmer, I am having trouble writing a loop that gets anywhere close to what I am shooting for.
Thank you for any suggestions you may have!

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources