I would like to find an efficient way to combine certain character + numeric column values in a list of SpatialPolygonsDataFrame objects. Here is reproducible data:
library(maptools) ## For wrld_simpl
library(sp)
## Example SpatialPolygonsDataFrames (SPDF)
data(wrld_simpl) #polygon of world countries
spdf1 <- wrld_simpl[1:25,] #country subset 1
spdf2 <- wrld_simpl[26:36,] #subset 2
spdf3 <- wrld_simpl[36:50,] #subset 3
#make list of SPDF objects
spdfl<-list()
spdfl[[1]]<-spdf1
spdfl[[2]]<-spdf2
spdfl[[3]]<-spdf3
#view data (attribute table) for one list element
spdfl[[1]]#data
What I would like to do is add another column that is a combination of the FIPS, REGION, and SUBREGION columns, separated by an underscore ('_'). I know how to add+name a new column to each SPDF object in the list as done in the loop below, but I don't know how to get the desired column row entry:
#add new 'unique.id' column to SPDF
for (i in 1:length(spdfl)){
spdfl[[i]]#data["unique.id"] = ""
}
The row entries for the new unique.id column would be in this format: FIPS_REGION_SUBREGION. For example, for the ATG polygon feature in spdfl[[1]], I would like the 'unique.id' column to have an entry like this:
unique.id
AC_19_29
Please advise on how to do this for all features in the SPDF list.
spdfl[[1]]#data$unique.id<-
paste(spdfl[[1]]#data$FIPS,spdfl[[1]]#data$REGION,spdfl[[1]]#data$SUBREGION,sep="_")
Edit: for your desired looping behavior:
for (i in 1:length(spdfl)){
spdfl[[i]]#data$unique.id<-
paste(spdfl[[i]]#data$FIPS,spdfl[[i]]#data$REGION,
spdfl[[i]]#data$SUBREGION,sep="_")
}
Related
I have a list of latitudes and longitudes (each row contains a labeling number, 2 points, which form a line between them), and I am trying to find the distance from one of the endpoints in each row to the line created by every other row in the list. I am using the geosphere(dist2gc) package and dplyr. I am trying to do it through looping (because I cannot figure out how to get lapply to do it), and I want to create a new column with the distances for each comparison (so if I start with 10 rows, I will compare 10 times 10 and have 10 new columns with the distances to the lines in each row). The output if I start with a 10x8 data.frame would be a 10x18 data.frame.
The temporary column name is not working. Please help!
Can anyone help me come up with the way to make variable naming work, or other approaches to do this?
dist <- function(df){
idx <- seq(1, nrow(df)) # create index values
namelist <- df[,2] # extract name list for new columns
for (i in idx){
templon <- df[i,3] # pull comparison value point for lat and lon
templat <- df[i,4]
tempcoln <- namelist[i]
df <- mutate(df, !! tempcoln := dist2gc(cbind(Longitude1, Latitude1),
cbind(Longitude2, Latitude2), cbind(templon, templat), r=radius,
sign=FALSE))
}
return(df)
}
I'm working with a spatial polygon dataframe.
data can be downloaded here:
http://geoportal.statistics.gov.uk/datasets/lower-layer-super-output-areas-december-2011-super-generalised-clipped-boundaries-in-england-and-wales
This contains the lower layer output area (lsoa) for England and Wales.
I need to subset the dataframe in order to keep only the polygons for the london lsoa11cd.
I have a list of lsoa11cd for London.
These are between E01000001 and E01004765. I'm not sure how to proceed to subset the spatial polygons (see image attached). Find below an attempt which does not work.
london <- shapefile[substr(shapefile#data$lsoa11cd, -7 , -1) <= 1004765, ]
london <- london[substr(london#data$lsoa11cd, -7 , -1) >= 1000001, ]
If I'm interpretting your question correctly, this should work nicely:
Use the shapefile function from the raster package to read-in the shapefile:
library(raster)
# Read-in the data. This will create a SpatialPolygonsDataFrame with 34,753 features
s <- shapefile('Lower_Layer_Super_Output_Areas_December_2011_Super_Generalised_Clipped__Boundaries_in_England_and_Wales.shp')
It looks like all of the lsoa11cd values have a letter and a number as the first two characters in the string. Let's first subset the data to keep only those with 'E' as the first chatacter for their lsoa11cd value.
s <- s[grep("^[aE].*", s$lsoa11cd), ]
Now we can remove the first two characters from each lsoa11cd string and convert to a numeric variable for easier subsetting as follows:
s$lsoa11cd <- as.numeric(substring(s$lsoa11cd, 3))
Then you can simply subset within the range you've specified:
s <- s[s$lsoa11cd %in% 1000001:1004765, ]
I'm working with sf objects and I have a question about applying user-defined functions to data frames. This is a silly example, but it is similar to what I'm trying to do with a more complicated problem. I have a data frame called names that has two columns, each with the name of a county. I also read in the North Carolina data included with the sf package.
require(sf)
name_1 <- c('Ashe','Alleghany','Surry')
name_2 <- c('Currituck','Northampton','Hertford')
names <- data.frame(cbind(name_1,name_2))
nc <- st_read(system.file("shape/nc.shp", package="sf"))
What I'm trying to do is create a function that goes down each row of my data frame names, takes the name out of the name_1 column and the name out of the name_2 column, and sees if they intersect using the geometric data in nc. I have:
check_intersection <- function(x){
return(st_intersects(nc[nc$NAME== x$name_1,],nc[nc$NAME==x$name_2,],sparse = FALSE))
}
apply(names,1,check_intersection)
But this yields an error of Error in x$name_1 : $ operator is invalid for atomic vectors. How do I tell the function to get the character values out of the appropriate columns, for each row in names?
Try
apply(names, 1,
function(x)
st_intersects(nc[nc$NAME == x[1],], nc[nc$NAME == x[2],],
sparse = FALSE)
)
In the anonymous function, x is a vector, not a data.frame, hence $ does not work.
I'm new to R so please excuse any terminology mistakes... I'm trying to extract the cell numbers for every county in the state of Oklahoma and paste them on top of each other so that I can use them to look at different temperatures throughout Oklahoma state. I have a shapefile of counties in the US, so I made a vector of all the county ID numbers for the state of OK. I then tried to extract the cell numbers and max temp values for every county in a loop. That extract line that I wrote works when I do it one county at a time, I think it's the okcounty=rbind line that's the problem but I don't know what the best way to do this is.
Thank you for your help! I really appreciate it.
`okcounties=which(counties$STATE_NAME=="Oklahoma") #contains 58 counties
county = NULL
for (i in 1:58){
countyvalues=extract(OK.tmax[[1]], extent(counties[okcounties[i],]), cellnumbers=T)
county=rbind(county, countyvalues) #add data from each of 58 counties
}`
I am finding your code a bit confusing and can see a few places it is going wrong. You are overthinking things a bit. I am not sure why you are extracting cellnumbers and not just taking advantage of extract and the stack object.
The "okcounties" object could be a sp class subset of the counties object, that you could pass directly to extract eg., okcounties <- counties[counties$STATE_NAME=="Oklahoma",] .
If you drop the call to extent, which is returning a bounding box for each county and not the county boundary, things get much simpler. To leverage the stack you could just let extract provide a data.frame of the raster values. Here is a worked example on synthetic data. I approximated your object naming convention for this example. The final object "ok.county" I believe would be the same as the "county" object that you are trying to create.
First, let's create some example data and plot
library(raster)
library(sp)
# create polygons
p <- raster(nrow=10, ncol=10)
p[] <- runif(ncell(p)) * 10
counties <- rasterToPolygons(p, fun=function(x){x > 9})
counties$county <- paste0("county",1:nrow(counties))
counties$STATE_NAME <- c(rep("CA",3),
rep("OK",nrow(counties)-3))
# Create raster stack
r <- raster(nrow=100, ncol=100)
r[] <- runif(ncell(r), 40,70)
r <- stack(r, r+5, r+10) # stack
names(r) <- c("June", "July", "Aug")
plot(r[[1]])
plot(p, add=TRUE, lwd=4)
We can use an index to subset to the state we are interested in.
ok <- counties[counties#data$STATE_NAME == "OK",]
Now we can use extract on the entire raster stack. The resulting object will be a list where each polygon has its own element in the list containing a data.frame. Each column of the data.frame represents a layer in the raster stack object.
ok.county <- extract(r, ok)
class(ok.county)
head(ok.county[[1]])
However, if you want to collapse the list into a single data.frame, unique polygon identifiers are missing. Here we are going to use the ID column in the SpatialPolygonsDataFrame object. Since the list is ordered the same as the polygon object you can assign unique values from the polygon object. In your case it would likely be the county names and the method would follow the same as the example.
cnames <- unique( counties#data$county )
for(i in 1:length(ok.county)) {
ok.county[[i]] <- data.frame(county = cnames[i], ok.county[[i]])
}
head(ok.county[[1]])
Now that we have a unique identifier assigned to each data.frame in the list we can collapse it using do.call.
ok.county <- as.data.frame(do.call("rbind", ok.county))
str(ok.county)
Using an apply function we can pull the maximum value for a given column (time-period) for each unique ID.
tapply(ok.county[,"June"], ok.county$county, max)
As to your original code, something like this would work (obviously, not tested) but there is no unique polygon ID tying results back to the county and it is still the bounding box of the county and not the polygon boundaries.
okcounties <- counties[counties$STATE_NAME=="Oklahoma",]
county = NULL
for (i in 1:nrow(okcounties)){
county <- rbind(county, extract(OK.tmax[[1]],
extent(okcounties[i,]), cellnumbers=T))
}
I have two data sets with latitude, longitude, and temperature data. One data set corresponds to a geographic region of interest with the corresponding lat/long pairs that form the boundary and contents of the region (Matrix Dimension = 4518x2)
The other data set contains lat/long and temperature data for a larger region that envelopes the region of interest (Matrix Dimenion = 10875x3).
My question is: How do you extract the appropriate row data (lat, long, temperature) from the 2nd data set that matches the first data set's lat/long data?
I've tried a variety of "for loops," "subset," and "unique" commands but I can't obtain the matching temperature data.
Thanks in advance!
10/31 Edit: I forgot to mention that I'm using "R" to process this data.
The lat/long data for the region of interest was provided as a list of 4,518 files containing the lat/long coordinates in the name of each file:
x<- dir()
lenx<- length(x)
g <- strsplit(x, "_")
coord1 <- matrix(NA,nrow=lenx, ncol=1)
coord2 <- matrix(NA,nrow=lenx, ncol=1)
for(i in 1:lenx) {
coord1[i,1] <- unlist(g)[2+3*(i-1)]
coord2[i,1] <- unlist(g)[3+3*(i-1)]
}
coord1<-as.numeric(coord1)
coord2<-as.numeric(coord2)
coord<- cbind(coord1, coord2)
The lat/long and temperature data was obtained from an NCDF file for with temperature data for 10,875 lat/long pairs:
long<- tempcd$var[["Temp"]]$size[1]
lat<- tempcd$var[["Temp"]]$size[2]
time<- tempcd$var[["Temp"]]$size[3]
proj<- tempcd$var[["Temp"]]$size[4]
temp<- matrix(NA, nrow=lat*long, ncol = time)
lat_c<- matrix(NA, nrow=lat*long, ncol=1)
long_c<- matrix(NA, nrow=lat*long, ncol =1)
counter<- 1
for(i in 1:lat){
for(j in 1:long){
temp[counter,]<-get.var.ncdf(precipcd, varid= "Prcp", count = c(1,1,time,1), start=c(j,i,1,1))
counter<- counter+1
}
}
temp_gcm <- cbind(lat_c, long_c, temp)`
So now the question is how do you remove values from "temp_gcm" that correspond to lat/long data pairs from "coord?"
Noe,
I can think of a number of ways you could do this. The simplest, albeit not the most efficient would be to make use of R's which() function, which takes a logical argument, while iterating over the data frame which you want to apply the matches to. Of course, this is assuming that there can be at most a single match in the larger data set. Based on your data sets, I would do something like this:
attach(temp_gcm) # adds the temp_gcm column names to the global namespace
attach(coord) # adds the coord column names to the global namespace
matched.temp = vector(length = nrow(coord)) # To store matching results
for (i in seq(coord)) {
matched.temp[i] = temp[which(lat_c == coord1[i] & long_c == coord2[i])]
}
# Now add the results column to the coord data frame (indexes match)
coord$temperature = matched.temp
The function which(lat_c == coord1[i] & long_c == coord2[i]) returns a vector of all rows in the dataframe temp_gcm which satisfy lat_c and long_c matching coord1 and coord2 respectively from row i in the iteration (NOTE: I'm assuming this vector will only have length 1, i.e. there is only 1 possible match). matched.temp[i] will then be assigned the value from the column temp in the dataframe temp_gcm which satisfied the logical condition. Note that the goal in doing this is that we create a vector which has matched values that correspond by index to the rows of the dataframe coord.
I hope this helps. Note that this is a rudimentary approach, and I would advise looking up the function merge() as well as apply() to do this in a more succinct manner.
I added an additional column full of zeros to use as the resultant for an IF statement. "x" is the number of rows in temp_gcm. "y" is the number of columns (representative of time steps). "temp_s" is the standardized temperature data
indicator<- matrix(0, nrow = x, ncol = 1)
precip_s<- cbind(precip_s, indicator)
temp_s<- cbind(temp_s, indicator)
for(aa in 1:x){
current_lat<-latitudes[aa,1] #Latitudes corresponding to larger area
current_long<- longitudes[aa,1] #Longitudes corresponding to larger area
for(ab in 1:lenx){ #Lenx coresponds to nrow(coord)
if(current_lat == coord[ab,1] & current_long == coord[ab,2]) {
precip_s[aa,(y/12+1)]<-1 #y/12+1 corresponds to "indicator column"
temp_s[aa,(y/12+1)]<-1
}
}
}
precip_s<- precip_s[precip_s[,(y/12+1)]>0,] #Removes rows with "0"s remaining in "indcator" column
temp_s<- temp_s[temp_s[,(y/12+1)]>0,]
precip_s<- precip_s[,-(y/12+1)] #Removes "indicator column
temp_s<- temp_s[,-(y/12+1)]