I'm working with sf objects and I have a question about applying user-defined functions to data frames. This is a silly example, but it is similar to what I'm trying to do with a more complicated problem. I have a data frame called names that has two columns, each with the name of a county. I also read in the North Carolina data included with the sf package.
require(sf)
name_1 <- c('Ashe','Alleghany','Surry')
name_2 <- c('Currituck','Northampton','Hertford')
names <- data.frame(cbind(name_1,name_2))
nc <- st_read(system.file("shape/nc.shp", package="sf"))
What I'm trying to do is create a function that goes down each row of my data frame names, takes the name out of the name_1 column and the name out of the name_2 column, and sees if they intersect using the geometric data in nc. I have:
check_intersection <- function(x){
return(st_intersects(nc[nc$NAME== x$name_1,],nc[nc$NAME==x$name_2,],sparse = FALSE))
}
apply(names,1,check_intersection)
But this yields an error of Error in x$name_1 : $ operator is invalid for atomic vectors. How do I tell the function to get the character values out of the appropriate columns, for each row in names?
Try
apply(names, 1,
function(x)
st_intersects(nc[nc$NAME == x[1],], nc[nc$NAME == x[2],],
sparse = FALSE)
)
In the anonymous function, x is a vector, not a data.frame, hence $ does not work.
Related
I have a list of plant names in a dataframe. Plant names come as a couplet with "genus" followed by "species". In my case the couplet is already split across columns (which should help). As a dummy example for three species (Helianthus annuus, Pinus radiatia, and Melaleuca leucadendra):
df <- data.frame(genus=c("Helianthus", "Pinus", "Melaleuca"), species=c("annuus","radiata", "leucadendra"))
I would like to use a function in the package "Taxize" to check these names against a database (IPNI).
There is no batch function for this, and annoyingly the format for querying a single name is:
checked <- ipni_search(genus='Helianthus', species='annuus')
What I need is a loop to feed each genus name and it's associated species name into that function.
I can do this for just genus:
list <- df$genus
checked <- lapply(list, function(z) ipni_search(genus=z))
but am tied up in all sorts of knots trying to pass the species with it.
Any help appreciated!
Cheers
Loop (or *apply) over the index, not the actual value:
checked = lapply(
1:nrow(df),
function(i) ipni_search(genus = df$genus[i], species = df$species[i])
)
Alternately, you can use Map which is made for iterating over multiple vectors/lists in parallel:
checked = Map(ipni_search, genus = df$genus, species = df$species)
I have a data frame in R, in which one of the columns contains state abbreviations like 'AL','MD' etc.
Say I wanted to extract the data for state = 'AL', then the following condition
dataframe['AL',] only seems to return one row, whereas there are multiple rows against this state.
Can someone help me understand the error in this approach.
this should work
mydataframe[mydataframe$state == "AL",]
or if you want more than one sate
mydataframe[mydataframe$state %in% c("AL","MD"),]
In R, there are always multiple ways to do something. We'll illustrate three different techniques that can be used to subset data in a data frame based on a logical condition.
We'll use data from the 2012 U.S. Hospital Compare Database. We'll check to see whether the data has already been downloaded to disk, and if not, download and unzip it.
if(!file.exists("outcome-of-care-measures.zip")){
dlMethod <- "curl"
if(substr(Sys.getenv("OS"),1,7) == "Windows") dlMethod <- "wininet"
url <- "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2FProgAssignment3-data.zip"
download.file(url,destfile='outcome-of-care-measures.zip',method=dlMethod,mode="wb")
unzip(zipfile = "outcome-of-care-measures.zip")
}
## read outcome data & keep hospital name, state, and some
## mortality rates. Notice that here we use the extract operator
## to subset columns instead of rows
theData <- read.csv("outcome-of-care-measures.csv",
colClasses = "character")[,c(2,7,11,17,23)]
This first technique matches the one from the other answer, but we illustrate it with both $ and [[ forms of the extract operator during the subset operation.
# technique 1: extract operator
aSubset <- theData[theData$State == "AL",]
table(aSubset$State)
AL
98
aSubset <- theData[theData[["State"]] == "AL",]
table(aSubset$State)
AL
98
>
Next, we can subset by using a Base R function, such as subset().
# technique 2: subset() function
aSubset <- subset(theData,State == "AL")
table(aSubset$State)
AL
98
>
Finally, for the tidyverse fans, we'll use dplyr::filter().
# technique 3: dplyr::filter()
aSubset <- dplyr::filter(theData,State == "AL")
table(aSubset$State)
AL
98
>
I want to loop through a list of country names contained world shape file and create individual shapefiles of each country. Then I want perform a calculation on the raster of each shapefile and coerce the results into a dataframe with the country name as an ID variable.
I have written this successfully for an individual country but am struggling to get it to loop correctly.
liech.map <- world.polys[world.polys$NAME == "Liechtenstein",]
plot(liech.map)
rasters <- stack(raster_1, raster_2)
rasters.values <- extract(rasters, liech.map)
df <- as.data.frame(rasters.values)
var <- as.data.frame(weighted.mean(x=df$raster_1, w=df$raster_2, na.rm=TRUE))
What I want to do is extract the list of country names from the world polygon shapefile, create a separate polygon for that country and loop it over every country. Then output 1 dataframe with the `var' in for each country with a country ID.
EDIT
Here is what I've managed to do so far, what I really want to be able to do is feed the following code a list of ID codes/names to loop over. I could of course copy and paste this manually 200-odd times, but this seems such a poor use of time!!
### leichenstein map
## 69.67 sec elapsed
tic()
LTU.map <- world.polys[world.polys$ISO3 == "LTU",]
rasters.values <- extract(rasters, LTU.map)
df <- as.data.frame(rasters.values)
rugged_LTU <- as.data.frame(weighted.mean(x=df$raster_1, w=df$raster_2, na.rm=TRUE))
var_LTU$iso3 <- "LTU"
rm(LTU.map)
toc()
# define master dataframe once
var_master <- var_LTU
### UK map
## 127.31 sec elapsed
tic()
GBR.map <- world.polys[world.polys$ISO3 == "GBR",]
rasters.values <- extract(rasters, GBR.map)
df <- as.data.frame(rasters.values)
rugged_GBR <- as.data.frame(weighted.mean(x=df$raster_1, w=df$raster_2, na.rm=TRUE))
var_GBR$iso3 <- "GBR"
var_master <- rbind(var_master, var_GBR)
rm(GBR.map)
toc()
So first thing first, we create a list to process. world.polys seems to be a data.frame or similar, we want to turn that into a named list.
polys_by_country <- split(world.polys, word.polys$ISO3)
Next we refactor the code for one country into a function:
extract_raster_value <- function(country_map) {
# Here imagine country map is your LTU.mnap
rasters.values <- extract(rasters, country_map)
df <- as.data.frame(rasters.values)
# compute weighted mean and implicitly return it (last value of function)
weighted.mean(x=df$raster_1, w=df$raster_2, na.rm=TRUE)
}
OK so extract_raster_value takes a country map and returns a single number, a weighted mean. Note that there is no need to "clean up" the workspace using rm. All local variables defined in the function are only the function scope and do not pollute the global environment.
You can check it works. I have to assume it does, since you have not provided a reproducible example.
LTU.map <- world.polys[world.polys$ISO3 == "LTU",]
extract_raster_value(LTU.map)
The next step is to apply extract_raster_value to each element of polys_by_country.
You can use the apply or lapply functions from base R, but I prefer to use the map family of functions from the purrr package.
library("purrr")
# Apply process_country to each element of the list and return the list of results
map(polys_by_country, process_country)
This returns a named list where the names are the ISO3 names and the values are your weighted mean.
Instead of a list, you can get the result in a named numeric vector with:
result <- map_dbl(polys_by_country, process_country)
This completely avoids loops (or more precisely, hides the loop).
You can easily turn the result into a data.frame if you want:
result_df <- data.frame(
country = names(result),
value = result
)
Of course, there may be much better way to do it depending on what is actually in world.polys... Typically, if it is a data.frame, this will be much faster to run:
library("dplyr")
world.polys %>%
group_by(ISO3) %>%
summarise(wm = weighted.mean(raster1, raster2))
I am attempting to merge a dataframe containing data on hospital visits ("Hospital_Visits_df") with a shapefile/spatial*dataframe containing ZIP/ZCTA polygons and coordinates ("shp", downloaded from the Census Bureau).
Both dataframes contain a matching column, ZCTA and GEOID10, respectively, though shp contains polygons for the entirety of the USA, which I will pare down to the relevant states later.
I have attempted using both merge() and left_join(), but both results in their own errors.
shp_hospital_zip1 <- merge(shp, Hospital_Visits_df, by.x = "GEOID10", by.y = "ZCTA")
Error in .local(x, y, ...) : non-unique matches detected
But there are no duplicates in either column.
shp_hospital_zip2 <- shp#data %>%
left_join(Hospital_Visits_df, by = c("GEOID10" = "ZCTA"))
which appeared to work, except when I went to summary(shp_hospital_zip2) it no longer specifies it as a spatial data frame.
I am looking to merge my hospital visit data with the spatial data so that I can plot it with leaflet, but I am getting tripped up at the first step.
I appreciate any help you can give me. Thank you very much !!
I think you need to use left_join() on your sf-object directly, not on the #data part of your sf-object.
library(sf)
class(shp)
[1] "sf" "data.frame"
class(Hospital_Visits_df)
[1] "data.frame"
result <- shp %>%
left_join(Hospital_Visits_df)
library(rgdal)
mydf <- read.csv("myCsv.csv")
myspdf <- readOGR("myShapefile.shp")
mynewspdf <- merge(myspdf, mydf, duplicateGeoms = T)
Reference
I have a data frame that I split into different subsets. Based on that I generate lists of sequences and the
distance matrices, followed by a hierarchical cluster analysis.
library(TraMineR)
library(WeightedCluster)
library(cluster)
data(mvad)
value1 <- min(grep('\\d{2}$', names(mvad)))
value2 <- max(grep('\\d{2}$', names(mvad)))
mvad.split <- split(mvad, f=mvad$male)
mvad.seq <- lapply(mvad.split, function(x){seqdef(x[value1:value2])})
mvad.om <- lapply(mvad.seq, function(x){seqdist(x, method='OM',
indel=1, sm='TRATE')})
mvad.dis <- lapply(mvad.om, function(x){as.dist(x)})
mvad.hc <- lapply(mvad.dis, function(x){hclust(x,
method='ward.D2')})
Then I want to convert my hierarchical clustering list into a seqtree list with the sequence data and distances
mvad.tree <- lapply(mvad.hc, function(x){as.seqtree(x,
seqdata=mvad.seq[[x]],
diss=mvad.dis[[x]],
weighted=F,
nclust=6)})
And get Error in mvad.dis[[x]] : invalid subscript type 'list'.
My actually data consists of many cohorts and the use of split() combined with lapply() would save me a lot of time. Any suggestions?
Within the last lapply, each element comes from a node in mvad.hc, hence are hclust objects. You don't want to lapply on this list but on the list of names and get corresponding object. Something like:
mvad.tree <- lapply(as.list(names(mvad.hc)), FUN=function(name){
as.seqtree(mvad.hc[[name]],seqdata=mvad.seq[[name]],
diss=mvad.dis[[name]],weighted=F,nclust=6)})
}
Not tested as we don't have data (cf. comment).