I'd like to append a set of ways which are related and give a district's boundary.
I tried the following but got stuck up:
require(osmar)
require(XML)
# a set of open street map ways (lines) related as given by a relation..
# (if connected these ways represent the boundary of a political
# district in Tyrol/Austria)
myxml <- xmlParse("http://api.openstreetmap.org/api/0.6/relation/85647")
# extracting way ids at the according xml-nodes:
els <- getNodeSet(myxml, "//member[#ref]")
ways <- as.numeric(sapply(els, function(el) xmlGetAttr(el, "ref")))
# now I try to get one of those ways as an osmar-obj and plot it,
# which throws an error:
plot_ways(get_osm(way(ways[1])))
apparently there's a boundingbox missing but I don't know how to assign it to this sort of object.. If I get this problem resolved I'd like to make one polygon out of the lines/ways.
the author of the package was so kind to give info that was lacking the current documentation:
the argument get_osm(.., all = T) was simply missing... with all = T all related elements are retrieved.
to get the my desired district-boundary the following code applies:
District_Boundary <- get_osm(relation(85647), all = T)
Related
I have a directory with a bunch of shapefiles for 50 cities (and will accumulate more). They are divided into three groups: cities' political boundaries (CityA_CD.shp, CityB_CD.shp, etc.), neighborhoods (CityA_Neighborhoods.shp, CityB_Neighborhoods.shp, etc.), and Census blocks (CityA_blocks.shp, CityB_blocks.shp, etc.). They use common file-naming syntaxes, have the same set of attribute variables, and are all in the same CRS. (I transformed all of them as such using QGIS.) I need to write a list of each group of files (political boundaries, neighborhoods, blocks) to read as sf objects and then bind the rows to create one large sf object for each group. However I am running into consistent problems developing this workflow in R.
library(tidyverse)
library(sf)
library(mapedit)
# This first line succeeds in creating a character string of the files that match the regex pattern.
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# This second line creates a list object from the files.
shapefile_list <- lapply(filenames, st_read)
# This third line (adopted from https://github.com/r-spatial/sf/issues/798) fails as follows.
districts <- mapedit:::combine_list_of_sf(shapefile_list)
Error: Column `District_I` cant be converted from character to numeric
# This fourth line fails in an apparently different way (also adopted from https://github.com/r-spatial/sf/issues/798).
districts <- do.call(what = sf:::rbind.sf, args = shapefile_list)
Error in CPL_get_z_range(obj, 2) : z error - expecting three columns;
The first error appears to be indicating that one of my shapefiles has an incorrect variable class for the common variable District_I but R provides no information to clue me into which file is causing the error.
The second error seems to be looking for a z coordinate but is only finding x and y in the geometry attribute.
I have four questions on this front:
How can I have R identify which list item it is attempting to read and bind is causing an error that halts the process?
How can I force R to ignore the incompatibility issue and coerce the variable class to character so that I can deal with the variable inconsistency (if that's what it is) in R?
How can I drop a variable entirely from the read sf objects that is causing an error (i.e. omit District_I for all read_sf calls in the process)?
More generally, what is going on and how can I solve the second error?
Thanks all as always for your help.
P.S.: I know this post isn't "reproducible" in the desired way, but I'm not sure how to make it so besides copying the contents of all my shapefiles. If I'm mistaken on this point, I'd gladly accept any wisdom on this front.
UPDATE:
I've run
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
shapefile_list <- lapply(filenames, st_read)
districts <- mapedit:::combine_list_of_sf(shapefile_list)
successfully on a subset of three of the shapefiles. So I've confirmed that there is some class conflict between the column District_I in one of the files causing the hold-up when running the code on the full batch. But again, I need the error to identify the file name causing the issue so I can fix it in the file OR need the code to coerce District_I to character in all files (which is the class I want that variable to be in anyway).
A note, particularly regarding Pablo's recommendation:
districts <- do.call(what = dplyr::rbind_all, shapefile_list)
results in an error
Error in (function (x, id = NULL) : unused argument
followed by a long string of digits and coordinates. So,
mapedit:::combine_list_of_sf(shapefile_list)
is definitely the mechanism to read from the list and merge the files, but I still need a way to diagnose the source of the column incompatibility error across shapefiles.
So after much fretting and some great guidance from Pablo (and his link to https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076), the following works:
library(tidyverse)
library(sf)
# Reads in all shapefiles from Directory that include the string "_CDs".
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
# Applies the function st_read from the sf package to each file saved as a character string to transform the file list to a list object.
shapefile_list <- lapply(filenames, st_read)
# Creates a function that transforms a problem variable to class character for all shapefile reads.
my_func <- function(data, my_col){
my_col <- enexpr(my_col)
output <- data %>%
mutate(!!my_col := as.character(!!my_col))
}
# Applies the new function to our list of shapefiles and specifies "District_I" as our problem variable.
districts <- map_dfr(shapefile_list, ~my_func(.x, District_I))
Problem:
I'm trying to import a newick format phylogenetic tree, I've done this before, (a tree made in the same way, so the code works!) however the tree appears to be the problem. I'm getting a duplicate tip labels error. If that is the case, is there a way to easily remove duplicate tips in R?
Current code:
library(ape)
library(geiger)
library(caper)
taxatree <- read.tree("test2.tre")
sumdata <- read.csv("ogtprop.csv")
sumdataPGLS <-data.frame(A=sumdata$A,OGT=sumdata$OGT, Species=sumdata$Species)
sumdataPGLS$Species<-gsub(" ", "_", sumdata$Species)
#this line inserts an underscore between species and genus in my dataframe, (as the tree is formatted like this)
comp.dat <- comparative.data(taxatree, sumdataPGLS, "Species")
I get the follow error after the last line:
Error in comparative.data(taxatree, sumdataPGLS, "Species") :
Duplicate tip labels present in phylogeny
Suggesting the problem is purely with the phylogeny, not the dataframe.
Desired outcome:
A way to remove duplicate tip labels in R
Input data:
Unfortunately the tree is so large, I can't put it all in here, however here is a subset of the data (note, this will not work by itself), I am presenting it here in-case there are any systematic errors which are obvious to others:
(((('Acidilobus_saccharovorans':4,'Caldisphaera_lagunensis':4)Acidilobales:4,
('Sulfurisphaera_tokodaii':4,('Metallosphaera_hakonensis':4,
'Metallosphaera_sedula':4)Metallosphaera:4,('Acidianus_sulfidivorans':4,
'Acidianus_brierleyi':4)Acidianus:4,('Sulfolobus_metallicus':4,
'Sulfolobus_solfataricus':4,'Sulfolobus_acidocaldarius':4)Sulfolobus:4)
Sulfolobaceae:4,(('Pyrolobus_fumarii':4,'Hyperthermus_butylicus':4,
'Pyrodictium_occultum':4)Pyrodictiaceae:4,('Aeropyrum_camini':4,
('Ignicoccus_hospitalis':4,'Ignicoccus_islandicus':4)Ignicoccus:4,
I've runned into the same problem with my comparative data. I had:
maxillariinae <- comparative.data(tree_gs, data.000, spp_code, vcv=TRUE, vcv.dim=3)
>Error in comparative.data(tree_gs, data.000, spp_code, vcv = TRUE, vcv.dim = 3) :
>Labels duplicated between tips and nodes in phylogeny
I've solved it in a very simple way:
# Removing node labels:
tree_gs$node.label<-NULL
And then when I tryed to set the comparative data, it just worked. The pgls I did next worked as well.
I hope it works for you.
One possible solution, as the issue appears to be the format of the tree being input into the 'phylo' class, in this case internal nodes have names, and some of these names are the same as genera.
A way to 'clean' the tree, is to format it, a way I found to work is through the python package: ete3 (http://etetoolkit.org/)
from ete3 import Tree
import sys
t = Tree(sys.argv[1], format=1)
t.write(format=5, outfile="test4.tre")
The useful function is t.write(format=5, format = 5, means it writes in a type acceptable for the comparitive.data function being used in R. In this case, without internal node names.
I ran into the same problem because my Newick tree included bootstrap support values in addition to distances. >comparative.data worked fine after removing the support values. (The bootstrap values were 0.97.. -0.99..)
Here are the original and revised trees:
Original
((Alligator:0.09129139,(Turtle:0.12361699,(Lizard:0.18330984,
((TasmDevil:0.02519765,Opossum:0.01841396)0.998733:0.03121792,
(Armadillo:0.05330751,((Cow:0.12244558,Dog:0.07483858)0.983085:0.02485452,
(Mouse:0.14438626,GuineaPig:0.03974587)0.972224:0.02107559)0.889194:0.01974521)
0.99985:0.03529365)0.99985:0.18024398)0.988266:0.074151)0.974215:0.11888747)
:1.0964437,Frog:1.0964437):0.0;
Revised
((Alligator:0.09129139,(Turtle:0.12361699,(Lizard:0.18330984,
((TasmDevil:0.02519765,Opossum:0.01841396):0.03121792,
(Armadillo:0.05330751,((Cow:0.12244558,Dog:0.07483858):0.02485452,
(Mouse:0.14438626,GuineaPig:0.03974587):0.02107559):0.01974521):0.03529365)
:0.18024398):0.074151):0.11888747):1.0964437,Frog:1.0964437):0.0;
I'm new to R and programming and taking a Coursera course. I've asked in their forums, but nobody can seem to provide an answer in the forums. To be clear, I'm trying to determine why this does not output.
When I first wrote the program, I was getting accurate outputs, but after I tried to upload, something went wonky. Rather than producing any output with [1], [2], etc. when I run the program from RStudio, I only get the the blue +++, but no errors and anything I change still does not produce an output.
I tried with a previous version of R, and reinstalled the most recent version 3.2.1 for Windows.
What I've done:
Set the correct working directory through RStudio
pol <- function(directory, pol, id = 1:332) {
files <- list.files("specdata", full.names = TRUE);
data <- data.frame();
for (i in ID) {
data <- rbind(data, read.csv(files_list[i]))
}
subset <- subset(data, ID %in% id);
polmean <- mean(subset[pol], na.rm = TRUE);
polmean("specdata", "sulfate", 1:10)
polmean("specdata", "nitrate", 70:72)
polmean("specdata", "nitrate", 23)
}
Can someone please provide some direction - debug help?
when I adjust the code the following errors tend to appear:
ID not found
Missing or unexpected } (although I've matched them all).
The updated code is as follow, if I'm understanding:
data <- data.frame();
files <- files[grepl(".csv",files)]
pollutantmean <- function(directory, pollutant, id = 1:332) {
pollutantmean <- mean(subset1[[pollutant]], na.rm = TRUE);
}
Looks like you haven't declared what ID is (I assume: a vector of numbers)?
Also, using 'subset' as a variable name while it's also a function, and pol as both a function name and the name of one of the arguments of that same function is just asking for trouble...
And I think there is a missing ")" in your for-loop.
EDIT
So the way I understand it now, you want to do a couple of things.
Read in a bunch of files, which you'll use multiple times without changing them.
Get some mean value out of those files, under different conditions.
Here's how I would do it.
Since you only want to read in the data once, you don't really need a function to do this (you can have one, but I think it's overkill for now). You correctly have code that makes a vector with the file names, and then loop over over them, rbinding them to each other. The problem is that this can become very slow. Check here. Make sure your directory only contains files that you want to read in, so no Rscripts or other stuff. A way (not 100% foolproof) to do this is using files <- files[grepl(".csv",files)], which makes sure you only have the csv's (grepl checks whether a certain string is a substring of another, and returns a boolean the [] then only keeps the elements for which a TRUE was returned).
Next, there is 'a thing you want to do multiple times', namely getting out mean values. This is where you'd use a function. Apparently you want to get the mean for different types of pollution, and you want this in restricted IDs.
Let's assume that 1. has given you a dataframe df with a column named Type for the type of pollution and a column called Id that somehow represents a sort of ID (substitute with the actual names in your script - if you don't have a column for ID, I'll edit the answer later on). Now you want a function
polmean <- function(type, id) {
# some code that returns the mean of a restricted version of df
}
This is all you need. You write the code that generates df, you then write a function that will get you what you want from that dataframe, and then you call it for the circumstances you want to use it in (the three polmean calls at the end of your original code, but now without the first argument as you no longer need this).
Ok - I finally solved this. Thanks for the help.
I didn't need to call "specdata" in line 2. the directory in line 1 referred to the correct directory.
My for/in statement needed to refer the the id in the first line not the ID in the dataset. The for/in statement doesn't appear to need to be indented (but it looks cleaner)
I did not need a subset
The last 3 lines for pollutantmean did not need to be a part of the program. These are used in the R console to call the results one by one.
Sorry, this might be too involved a question to ask here. I'm trying to reproduce the Hack Session for NYTime Dialect Map Visualisation, located here. I'm OK in the beginning, but then I run into a problem when I try to scape multiple pages.
To save people from having to reproduce info from the slides, this is what I have so far:
Create URL addresses:
mainURL <- 'http://www4.uwm.edu/FLL/linguistics/dialect/staticmaps/'
stateURL <- 'states.html'
url <- paste0(mainURL, stateURL)
Download and Parse
tmp <- getURL(url)
tmp <- htmlTreeParse(tmp, useInternalNodes = TRUE)
Extract page addresses and save to subURL
subURL <- unlist(xpathSApply(tmp, '//a[#href]', xmlAttrs))
Remove pages that aren't state's names
subURL <- subURL[-(1:4)]
The problem begins for me on slide 24 in the original. The slides say that the next step is to loop over the list of states and read the body of each question. Of course, we also need to save the name of each state in the process. The loop is initialized with the following code:
survey <- vector(length(subURL), mode = "list")
i = 1
stateNames <- rep('', length(subURL))
Underneath this code, the slide says that survey is a list where information about every state is saved. I'm a little puzzled here about how that is the case, since survey is indeed a list with a length of 51, but every element is NULL. I'm also puzzled by what the i is doing here (and this becomes important later). Still, I can follow what the code is doing, and I assumed that the list would get populated later.
It's really the next slide where I get confused. As an example, it is shown how the URL contains the name of each state, using Alaska as an example:
Create URL for the first state and assign to suburl
suburl <- subURL[1]
Remove state_ from suburl
stateName <- gsub('state_','',suburl)
Remove .html from stateName
stateName <- gsub('.html','',stateName)
So far, so good. I can do this for each state individually. However, I can't figure out how to turn this into a loop that would apply to all the states. The slide only has the following code:
stateNames[i] <- stateName
This is where I am stuck. The previous slide assigned 1 to i, so the only thing this does is get the name for Alaska (AK), but every other element is "" (as one expect, given how stateNames was defined previously).
I did try the following:
stateNames <- gsub('state_','',subURL)
stateNames <-gsub('.html','',stateNames)
This doesn't quite work, because the lengths of this vector is 51, but the length of the one shown above is only 1. (Later, I want each state to have its own name, not for all the states to have the same 51 state name). Moreover, I didn't know what to do with the stateNames(i) <- stateName command.
Anyways, I kept working through to the end (both with the original, and the modification), hoping that things would eventually right themselves (and at times I got the same as what was on the presentation), but eventually things just broke). I think there is an additional problem later on in the slides (an object is subsetted that didn't exist before), but I'm guessing a problem also arises from a problem that occurs much easier.
Anyways, I know this is a pretty involved question, so I apologize if it is inappropriate for this site. I'm just stuck.
I believe I got this to work. See the gist or see here for the solution.
I am trying to download an XML file of journal article records and create a dataset for further interrogation in R. I'm completely new to XML and quite novice at R. I cobbled together some code using bits of code from 2 sources:
GoogleScholarXScraper
and
Extracting records from pubMed
library(RCurl)
library(XML)
library(stringr)
#Search terms
SearchString<-"cancer+small+cell+non+lung+survival+plastic"
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL)
#Seach
pub.esearch<-getURL(mySearch)
#Extract QueryKey and WebEnv
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE)
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]]))
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]])
#Fetch Records
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key)
pub.efetch<-getURL(myFetch)
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE)
#Create dataset of article characteristics #This doesn't work
pub.data<-NULL
pub.data<-data.frame(
journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue),
abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue),
affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue),
year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue)
,stringsAsFactors=FALSE)
The main problem I seem to have is that my returned XML file is not completely uniformly structured. For example, some references have a node structure like this:
- <Abstract>
<AbstractText>The Wilms' tumor gene... </AbstractText>
Whilst some have labels and are like this
- <Abstract>
<AbstractText Label="BACKGROUND & AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText>
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>
When I extract the 'AbstactText' I am hoping to get 24 rows of data back (there are 24 records when I run this made up search today), but xpathSApply returns all labels within 'AbstactText' as individual elements of my dataframe. Is there a way to collapse the XML structure in this instance/Ignore the labels? Is there a way to make xpathSApply return 'NA' when nothing is found at end of a path? I am aware of xmlToDataFrame, which sounds like it should fit the bill, but whenever I try to use this it doesn't seem to give me anything sensible.
Thanks for your help
I am unsure as to which you want however:
xpathSApply(myxml,"//*/AbstractText[#Label]")
will get the nodes with labels (keeping all attributes etc).
xpathSApply(myxml,"//*/AbstractText[not(#Label)]",xmlValue)
will get the nodes without labels.
EDIT:
test<-xpathApply(myxml,"//*/Abstract",xmlValue)
> length(test)
[1] 24
may give you what you want
EDIT:
to get affiliation, year etc padded with NA's
dumfun<-function(x,xstr){
res<-xpathSApply(x,xstr,xmlValue)
if(length(res)==0){
out<-NA
}else{
out<-res
}
out
}
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Affiliation')
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Journal/JournalIssue/PubDate/Year')