I'm new to R so please excuse any terminology mistakes... I'm trying to extract the cell numbers for every county in the state of Oklahoma and paste them on top of each other so that I can use them to look at different temperatures throughout Oklahoma state. I have a shapefile of counties in the US, so I made a vector of all the county ID numbers for the state of OK. I then tried to extract the cell numbers and max temp values for every county in a loop. That extract line that I wrote works when I do it one county at a time, I think it's the okcounty=rbind line that's the problem but I don't know what the best way to do this is.
Thank you for your help! I really appreciate it.
`okcounties=which(counties$STATE_NAME=="Oklahoma") #contains 58 counties
county = NULL
for (i in 1:58){
countyvalues=extract(OK.tmax[[1]], extent(counties[okcounties[i],]), cellnumbers=T)
county=rbind(county, countyvalues) #add data from each of 58 counties
}`
I am finding your code a bit confusing and can see a few places it is going wrong. You are overthinking things a bit. I am not sure why you are extracting cellnumbers and not just taking advantage of extract and the stack object.
The "okcounties" object could be a sp class subset of the counties object, that you could pass directly to extract eg., okcounties <- counties[counties$STATE_NAME=="Oklahoma",] .
If you drop the call to extent, which is returning a bounding box for each county and not the county boundary, things get much simpler. To leverage the stack you could just let extract provide a data.frame of the raster values. Here is a worked example on synthetic data. I approximated your object naming convention for this example. The final object "ok.county" I believe would be the same as the "county" object that you are trying to create.
First, let's create some example data and plot
library(raster)
library(sp)
# create polygons
p <- raster(nrow=10, ncol=10)
p[] <- runif(ncell(p)) * 10
counties <- rasterToPolygons(p, fun=function(x){x > 9})
counties$county <- paste0("county",1:nrow(counties))
counties$STATE_NAME <- c(rep("CA",3),
rep("OK",nrow(counties)-3))
# Create raster stack
r <- raster(nrow=100, ncol=100)
r[] <- runif(ncell(r), 40,70)
r <- stack(r, r+5, r+10) # stack
names(r) <- c("June", "July", "Aug")
plot(r[[1]])
plot(p, add=TRUE, lwd=4)
We can use an index to subset to the state we are interested in.
ok <- counties[counties#data$STATE_NAME == "OK",]
Now we can use extract on the entire raster stack. The resulting object will be a list where each polygon has its own element in the list containing a data.frame. Each column of the data.frame represents a layer in the raster stack object.
ok.county <- extract(r, ok)
class(ok.county)
head(ok.county[[1]])
However, if you want to collapse the list into a single data.frame, unique polygon identifiers are missing. Here we are going to use the ID column in the SpatialPolygonsDataFrame object. Since the list is ordered the same as the polygon object you can assign unique values from the polygon object. In your case it would likely be the county names and the method would follow the same as the example.
cnames <- unique( counties#data$county )
for(i in 1:length(ok.county)) {
ok.county[[i]] <- data.frame(county = cnames[i], ok.county[[i]])
}
head(ok.county[[1]])
Now that we have a unique identifier assigned to each data.frame in the list we can collapse it using do.call.
ok.county <- as.data.frame(do.call("rbind", ok.county))
str(ok.county)
Using an apply function we can pull the maximum value for a given column (time-period) for each unique ID.
tapply(ok.county[,"June"], ok.county$county, max)
As to your original code, something like this would work (obviously, not tested) but there is no unique polygon ID tying results back to the county and it is still the bounding box of the county and not the polygon boundaries.
okcounties <- counties[counties$STATE_NAME=="Oklahoma",]
county = NULL
for (i in 1:nrow(okcounties)){
county <- rbind(county, extract(OK.tmax[[1]],
extent(okcounties[i,]), cellnumbers=T))
}
Related
I have a sf object as in the example here:
library(sf)
fname <- system.file("shape/nc.shp", package="sf")
nc <- st_read(fname)
plot(nc[1])
Created on 2021-04-15 by the reprex package (v2.0.0)
I want to subset my data in such a way that I could get the approximate four different sf objects for four different quadrants.
For the data which I am working now, subset method like this nc[1:50, ] doesn't make sense since the rows are randomly ordered. And doing so will reduce the number of features but not the extent. I even tried group_by(geom), didn't work for me.
Can you help me here with this part using nc data as example?
I suggest you assign your objects to quadrants via sf::st_join().
It has a very helpful argument largest which ensures that the small polygons are not multiplied (but assigned to the quadrant to which falls the largest area of the small polygon). So NC keeps all 100 counties / and no duplicites are created.
To create the quadrants object consider applying sf::st_make_grid() to the bounding box of your spatial object, specifying that you want two by two split.
For a full workflow consider the following code:
library(sf)
fname <- system.file("shape/nc.shp", package="sf")
nc <- st_read(fname)
plot(nc[1])
# create quadrants
quads <- st_bbox(nc) %>%
st_make_grid(n = 2) %>%
st_as_sf(crs = st_crs(nc)) %>%
dplyr::mutate(quad_id = 1:4)
# a visual check
plot(st_geometry(nc))
plot(st_geometry(quads), add = T)
# intersect NC by the quadrants
nc_intersected <- st_join(nc,
quads,
largest = T) # do *not* multiply polygons!
# a visual check
plot(nc_intersected["quad_id"])
Hi I have a few dataframes each representing samples receiving a kind of treatment that I have combined into a list, the idea is I want to test Kmeans clustering method on each of the element/dataframe inside the list, then use the KNN to find closest data point in the data to sub sample.
say I have these 2 dataframes that I bind into a list. Here are sample data https://drive.google.com/drive/folders/1B8JQY94Z-BHTZEKlV4dvUDocmiyppBDa?usp=sharing
Each dataframe has the same structure: many rows of samples and 107 columns of variables, but the 1st and 2nd columns are just data labels such as the actual drug treatment.
So I'm trying to perform Kmeans clustering on each element(dataframe) in a list, from the outputs of the kmeans clustering, I took the "centers" that matches each data frame and bind all the centers into another list.
Next, what I want to do is to use the function get.knnx(), so I can use each centers generated by kmeans clustering and with that going back to the original data frame to sample 500 data points that are the closest to the centre, to achieve a good subsampling of the data.
library(tidyverse)
library(purr)
#take data into list
mylist <- list(df1,df2)
#perform Kmeans cluster
#scale datainput and drop the data label column
Kmeans.list <- map(.x = mylist,
.f = ~kmeans(scale(.x[,-c(1:2)]),
centers =15,
nstart=50,
iter.max = 100)) %>%
purrr::set_names(c("df1", "df2"))
#Isolate the Centers info to another list
Kmeans_centers <- map(Kmeans.list, ~.x$centers)
#trying to use map2
y <- map2(.x = mylist,.y=Kmeans_centers,
.f=~get.knnx(scale(.x[,-c(1:2)]),.y, 500)) %>%
purrr::set_names(c("df1","df2"))
The output of get.knnx has 2 components per element in the output list, one is nn.index and the other is the nn.dist. The nn.index represent the actual row index location of the data and thus I want to use this index to go back the original data (df1,df2) to find those data points, and use those data points as the best representative of the data.
I don't quite know how to only use the nn.index part of the list y, so I think first I can take out the nn.index and make them a list
#bind all the centers into a list for mapping
y.nnindex <- map(y, ~.x$nn.index)%>%
purrr::set_names(c("df1", "df2"))
Each index corresponding to df1 or df2 should be 15 X 500, 15 centres chosen from Kmean, and 500 points selected by KNN.
Here is the part I'm stuck, I want to order the nnindex, then use that to subset the original data. I think I can build a list of sampled data then bind them into a dataframe.I get error saying Error in mylist[i][idx, ] : incorrect number of dimensions. I think I'm wrong at how to access the nnindex for each dataframe in the list and how to subset things from the elements of a list. Always stuck with this type of problem. Some pointers and explanations will be much appreciated
cl.list <- list()
for (i in 1:2) {
idx <- sort(y.nnindex[[i]])
cl.list[[i]] <- as.data.frame(cbind(idx, mylist[i][idx,]))
}
I'm working with a spatial polygon dataframe.
data can be downloaded here:
http://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-super-generalised-clipped-boundaries-in-england-and-wales
This contains the lower layer output area (lsoa) for England and Wales.
I need to subset the dataframe in order to keep only the polygons for the london lsoa11cd.
I have a list of lsoa11cd for London.
These are between E01000001 and E01004765. I'm not sure how to proceed to subset the spatial polygons (see image attached). Find below an attempt which does not work.
london <- shapefile[substr(shapefile#data$lsoa11cd, -7 , -1) <= 1004765, ]
london <- london[substr(london#data$lsoa11cd, -7 , -1) >= 1000001, ]
If I'm interpretting your question correctly, this should work nicely:
Use the shapefile function from the raster package to read-in the shapefile:
library(raster)
# Read-in the data. This will create a SpatialPolygonsDataFrame with 34,753 features
s <- shapefile('Lower_Layer_Super_Output_Areas_December_2011_Super_Generalised_Clipped__Boundaries_in_England_and_Wales.shp')
It looks like all of the lsoa11cd values have a letter and a number as the first two characters in the string. Let's first subset the data to keep only those with 'E' as the first chatacter for their lsoa11cd value.
s <- s[grep("^[aE].*", s$lsoa11cd), ]
Now we can remove the first two characters from each lsoa11cd string and convert to a numeric variable for easier subsetting as follows:
s$lsoa11cd <- as.numeric(substring(s$lsoa11cd, 3))
Then you can simply subset within the range you've specified:
s <- s[s$lsoa11cd %in% 1000001:1004765, ]
I need to cluster some data and I tried kmeans, pam, and clara with R.
The problem is that my data are in a column of a data frame, and contains NAs.
I used na.omit() to get my clusters. But then how can I associate them with the original data? The functions return a vector of integers without the NAs and they don't retain any information about the original position.
Is there a clever way to associate the clusters to the original observations in the data frame? (or a way to intelligently perform clustering when NAs are present?)
Thanks
The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to.
Here's a simple example:
d <- data.frame(x=runif(100), cluster=NA)
d$x[sample(100, 10)] <- NA
clus <- kmeans(na.omit(d$x), 5)
d$cluster[which(!is.na(d$x))] <- clus$cluster
And in the plot below, colour indicates the cluster that each point belongs to.
plot(d$x, bg=d$cluster, pch=21)
This code works for me, starting with a matrix containing a whole row of NAs:
DF=matrix(rnorm(100), ncol=10)
row.names(DF) <- paste("r", 1:10, sep="")
DF[3,]<-NA
res <- kmeans(na.omit(DF), 3)$cluster
res
DF=cbind(DF, 'clus'=NA)
DF[names(res),][,11] <- res
print(DF[,11])
I saw these codes from here: http://learnr.wordpress.com/2009/08/10/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-9/
hc1 <- hclust(dist(USArrests, method = "canberra"))
hc1 <- as.dendrogram(hc1)
ord.hc1 <- order.dendrogram(hc1)
hc2 <- reorder(hc1, state.region[ord.hc1])
ord.hc2 <- order.dendrogram(hc2)
region.colors <- trellis.par.get("superpose.polygon")$col
USArrests2 <- melt(t(scale(USArrests)))
USArrests2$X2 <- factor(USArrests2$X2, levels = state.name[ord.hc2])
But I'm very confused in the forth line about the state.region variable.
The variable about the order is ord.hc1 was generated from USArrests, which seems to have nothing to do with state.region. Then why it uses state.region for reordering instead of a column within USArrests data frame?
Look at the help file for state.region -
?state.region
The first sentence under Details, is
R currently contains the following "state" data sets.
Note that all data are arranged according to alphabetical
order of the state names.
This means that we can jump between the USA data sets, since they are all in the same order, i.e. the state of the first entry of the USArrests is the same as the state in state.region.