I have a data frame containing coordinates to various locations that I'd like to use with Google Earth. Here's a simple example showing the structure:
data <- data.frame(country = "USA", city = "Saint Paul",
lat = 44.9629, lon = -93.00146)
I followed this SO post and this guide to create KML output successfully using the writeOGR() function from the rgdal package, however I'm having trouble tweaking the attributes. Here's the code:
# you may need to install gdal itself for the package to install successfully
# install.packages("rgdal")
library(rgdal)
data_sp <- data
coordinates(data_sp) <- c("lon", "lat")
proj4string(data_sp) <- CRS("+init=epsg:4238")
data_ll <- spTransform(data_sp, CRS("+proj=longlat +datum=WGS84"))
writeOGR(data_ll["city"], "/path/to/test.kml", driver = "KML", layer = "city")
The result works fine for just viewing locations, but I'd like to change the <styleUrl> attribute as well as have the <name> attribute populated. Without it, Google Earth shows locations with a [no name] attribute:
Here's the resultant .kml file:
<?xml version="1.0" encoding="utf-8" ?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document><Folder><name>city</name>
<Placemark>
<ExtendedData><SchemaData schemaUrl="#city">
<SimpleData name="city">Saint Paul</SimpleData>
</SchemaData></ExtendedData>
<Point><coordinates>-93.001753817020003,44.96282130428127</coordinates></Point>
</Placemark>
</Folder>
<Schema name="city" id="city">
<SimpleField name="city" type="string"></SimpleField>
</Schema>
</Document></kml>
I need to either get a <name> element to populate with the SimpleField name="city" contents, or have <name>City</name> tags added to each <Placemark>. What I'd like is something like this as the final result (note added <Style> definition, <styleUrl> attribute for the <Placemark>, and <name> attribute added):
<?xml version="1.0" encoding="utf-8" ?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Style id="custom">
<IconStyle>
<scale>1.5</scale>
<Icon>
<href>http://upload.wikimedia.org/wikipedia/commons/a/af/Tux.png</href>
</Icon>
</IconStyle>
</Style>
<Folder><name>city</name>
<Placemark>
<name>Saint Paul</name>
<styleUrl>#custom</styleUrl>
<ExtendedData><SchemaData schemaUrl="#city">
<SimpleData name="city">Saint Paul</SimpleData>
</SchemaData></ExtendedData>
<Point><coordinates>-93.001753817020003,44.96282130428127</coordinates></Point>
</Placemark>
</Folder>
<Schema name="city" id="city">
<SimpleField name="city" type="string"></SimpleField>
</Schema>
</Document></kml>
Here's what the result looks like (similar to what I'm aiming for):
The rgdal documentation mentions a layer_options attribute, but nothing intuitively stuck out to me...
layer_options = c("<name>????</name>")?
layer_options = c("<styleUrl>#custom</styleUrl")?
Something else?
The attempts above to pass a tag directly don't appear to affect the output.
There's not many examples I found in googling other than creating the default output from writeOGR(), as shown above. Thanks for any suggestions.
To expand on #jlhoward's answer above, I was able to use kmlPoints() to accomplish what I was looking for:
data <- data.frame(country = "USA", city = "Saint Paul",
lat = 44.9629, lon = -93.00146)
# you may need to install gdal itself for the package to install successfully
# install.packages("rgdal")
library(rgdal)
library(maptools)
data_sp <- data
coordinates(data_sp) <- c("lon", "lat")
proj4string(data_sp) <- CRS("+init=epsg:4238")
data_ll <- spTransform(data_sp, CRS("+proj=longlat +datum=WGS84"))
kmlPoints(data_ll["city"], kmlfile = "~/Desktop/test.kml",
name = data_ll$city,
icon = "http://upload.wikimedia.org/wikipedia/commons/a/af/Tux.png")
The output contains both the desired <name> attribute as well as a <Style> definition for the custom icon, which is applied successfully to the <Placemark> entries:
readLines("test.kml")
readLines("test.kml")
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
[2] "<kml xmlns=\"http://earth.google.com/kml/2.2\">"
[3] "<Document>"
[4] "<name></name>"
[5] "<description><![CDATA[]]></description>"
[6] ""
[7] "<Style id=\"style1\">"
[8] " <IconStyle>"
[9] " <Icon>"
[10] " <href>http://upload.wikimedia.org/wikipedia/commons/a/af/Tux.png</href>"
[11] " </Icon>"
[12] " </IconStyle>"
[13] "</Style>"
[14] ""
[15] "<Placemark>"
[16] " <name>Saint Paul</name>"
[17] " <description><![CDATA[]]></description>"
[18] " <styleUrl>#style1</styleUrl>"
[19] " <Point>"
[20] " <coordinates>"
[21] "-93.00175381702,44.9628213042813"
[22] " </coordinates>"
[23] " </Point>"
[24] "</Placemark>"
[25] "</Document>"
[26] "</kml>"
The result:
Well, if all you want to do is populate the <name> element in each <Placemark>, this will do it:
library(maptools)
kmlPoints(data_ll,"test.kml",name=data$city)
readLines("test.kml")
# [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
# [2] "<kml xmlns=\"http://earth.google.com/kml/2.2\">"
# [3] "<Document>"
# [4] "<name></name>"
# ...
# [15] "<Placemark>"
# [16] " <name>Saint Paul</name>"
# [17] " <description><![CDATA[]]></description>"
# [18] " <styleUrl>#style1</styleUrl>"
# [19] " <Point>"
# [20] " <coordinates>"
# [21] "-93.00175381702,44.9628213042813"
# [22] " </coordinates>"
# [23] " </Point>"
# [24] "</Placemark>"
# [25] "</Document>"
# [26] "</kml>"
If you need to change the <Style> as well, then I'm afraid you may have to hack the kml file using the XML package.
Related
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
I have the following xml page that looks like this which I need to parse using xml2
However, with this code, I cannot get the list under the subcellularLocation xpath :
library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"
doc <- xmlfile %>%
xml2::read_xml()
xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)
What is the right way to do it?
Update
The desired output is a vector:
[1] "Nucleus"
[2] "Chromosome"
[3] "Cytoplasm"
[4] "Secreted"
[5] "Cell membrane"
[6] "Peripheral membrane protein"
[7] "Extracellular side"
[8] "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
Use x <- xml_find_all(doc, "//d1:subcellularLocation")
Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)
# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
<root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
<f:doc><g:baz /></f:doc>
<f:doc><g:baz /></f:doc>
</root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))
So you then go to check xml_ns(doc) and find
d1 <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance
Update
xml_find_all(doc, "//d1:subcellularLocation")
%>% xml_children()
%>% xml_text()
## [1] "Nucleus"
## [2] "Chromosome"
## [3] "Cytoplasm"
## [4] "Secreted"
## [5] "Cell membrane"
## [6] "Peripheral membrane protein"
## [7] "Extracellular side"
## [8] "Endosome"
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"
If you don't mind, you can use the rvest package:
library(rvest)
a=read_html(xmlfile)%>%
html_nodes("subcellularlocation")
a%>%html_children()%>%html_text()
[1] "Nucleus" "Chromosome"
[3] "Cytoplasm" "Secreted"
[5] "Cell membrane" "Peripheral membrane protein"
[7] "Extracellular side" "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])
Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.
First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "
Probably a simple question and I have looked at the many options in scan but havent got what I want yet.
A simple example would be
require(httr)
example <- content(GET("http://www.r-project.org"), as = 'text')
write(example, 'text.txt')
input <- readLines('text.txt')
> example
[1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n<html>\n<head>\n<title>The R Project for Statistical Computing</title>\n<link rel=\"icon\" href=\"favicon.ico\" type=\"image/x-icon\">\n<link rel=\"shortcut icon\" href=\"favicon.ico\" type=\"image/x-icon\">\n<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">\n</head>\n\n<FRAMESET cols=\"1*, 4*\" border=0>\n<FRAMESET rows=\"120, 1*\">\n<FRAME src=\"logo.html\" name=\"logo\" frameborder=0>\n<FRAME src=\"navbar.html\" name=\"contents\" frameborder=0>\n</FRAMESET>\n<FRAME src=\"main.shtml\" name=\"banner\" frameborder=0>\n<noframes>\n<h1>The R Project for Statistical Computing</h1>\n\nYour browser seems not to support frames,\nhere is the contents page of the R Project's\nwebsite.\n</noframes>\n</FRAMESET>\n\n\n\n"
input
[1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html>"
[3] "<head>"
[4] "<title>The R Project for Statistical Computing</title>"
[5] "<link rel=\"icon\" href=\"favicon.ico\" type=\"image/x-icon\">"
[6] "<link rel=\"shortcut icon\" href=\"favicon.ico\" type=\"image/x-icon\">"
[7] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[8] "</head>"
[9] ""
[10] "<FRAMESET cols=\"1*, 4*\" border=0>"
[11] "<FRAMESET rows=\"120, 1*\">"
[12] "<FRAME src=\"logo.html\" name=\"logo\" frameborder=0>"
[13] "<FRAME src=\"navbar.html\" name=\"contents\" frameborder=0>"
[14] "</FRAMESET>"
[15] "<FRAME src=\"main.shtml\" name=\"banner\" frameborder=0>"
[16] "<noframes>"
[17] "<h1>The R Project for Statistical Computing</h1>"
[18] ""
[19] "Your browser seems not to support frames,"
[20] "here is the contents page of the R Project's"
[21] "website."
[22] "</noframes>"
[23] "</FRAMESET>"
[24] ""
[25] ""
[26] ""
[27] ""
the motivation for this is that I want to store various files in Postgresql and I am passing them in in the format given by example as opposed to input. Apologies if I havent explained very well.
#Hong Ooi gave a nice answer using readChar. I have encoding issues so have had to wrap
iconv(readChar(file, nchars=file.info(file)["size"], TRUE), from = "latin1", to = "UTF-8")
to stop the database complaining.
If you want all those strings concatenated into a single string:
paste(input, collapse="\n")
Alternatively, if you're reading from a file and want to avoid splitting the input into bits and putting them back together:
f <- readChar(file, nchars=file.info(file)["size"], TRUE)