Error in subset.default using bulk_postcode_loo - r

Ultimately I want to use postcodes for all state-funded secondary schools in England, but for now I'm trying to figure out what code I will need to use, so using a selection of just 5.
I want to retrieve the coordinates (so latitude and longitude) and the lsoa value for each postcode.
pc_list <- list(postcodes = c("PE7 3BY", "ME15 9AZ", "BS21 6AH", "SG18 8JB", "M11 2NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
This returns all the information about those 5 postcodes. Now I want it just to return information on those 3 variables (latitude, longitude and lsoa) that I'm interested in.
pclist2 <- subset(pclist1, select = c(longitude, latitude, lsoa))
This returns the following error.
Error in subset.default(pclist1, select = c(longitude, latitude, lsoa)) :
argument "subset" is missing, with no default
Once I am able to get this information, I want to add these 3 variables along with their relevant postcode into a new dataframe that I can perform susbequent analysis on - is this what pclist2 will be?

Slightly modified example from https://docs.ropensci.org/PostcodesioR/articles/Introduction.html#multiple-postcodes , for whatever reason I only received positive responses when removed spaces from postcodes :
library(PostcodesioR)
library(purrr)
pc_list <- list(postcodes = c("PE73BY", "ME159AZ", "BS216AH", "SG188JB", "M112NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
# extract 2nd list item from each response (the "result" list)
bulk_list <- lapply(pclist1, "[[", 2)
# extract list of items from results lists, return tibble / data frame
bulk_df <- map_dfr(bulk_list, `[`, c("postcode", "longitude", "latitude", "lsoa"))
Resulting tibble / data frame :
bulk_df
#> # A tibble: 5 × 4
#> postcode longitude latitude lsoa
#> <chr> <dbl> <dbl> <chr>
#> 1 PE7 3BY -0.226 52.5 Peterborough 019D
#> 2 ME15 9AZ 0.538 51.3 Maidstone 013C
#> 3 BS21 6AH -2.84 51.4 North Somerset 005A
#> 4 SG18 8JB -0.249 52.1 Central Bedfordshire 006C
#> 5 M11 2NA -2.18 53.5 Manchester 015E
Created on 2023-01-13 with reprex v2.0.2

Related

Web scraping an interactive chart using R

I'm new to web scraping and am trying to scrape the data from this interactive chart using R so that all the series are displayed in a single table: https://www.e61.in/spendtracker
I've used developer tools in chrome (inspect - network - fetch/XHR) but cannot find the data points.
Would be highly appreciative if someone can take a quick look and let me know a) if the data points are stored on the page somewhere b) if possible, explain how they identified the right file, and c) if it is a reasonably straightforward task to then generate a table?
Continuing from that iframe url -
before switching to R & rvest you should check the actual page source and perhaps run it though some beautifier. You'll see Plotly.newPlot() call, check how it gets array of those data series as a 2nd parameter. One option would be extracting that piece of javascript with regex, parse it as JSON and work from there.
Perhaps something like this:
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(jsonlite)
library(purrr)
library(ggplot2)
url <- "https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html"
html <- read_html(url)
# extract last <script> tag containing Plotly.newPlot() and dataseries'
plotly_js <- html %>%
html_element("script:last-of-type") %>%
html_text()
# extract array from js string, using \Q and \E to no escape all special chars
p_dataseries <- str_extract(plotly_js, '\\Q[{"connectgaps"\\E.*?\\Q"type":"scatter"}]\\E' )
# parse extracted string
ds_j <- fromJSON(p_dataseries,simplifyVector = FALSE)
# extract data, result will be in long format
df <- map_df(ds_j, `[`, c("name", "x", "y")) %>%
unnest(c(x,y)) %>%
mutate(date = as.POSIXct(x))
str(df)
#> tibble [2,346 × 4] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:2346] "Total" "Total" "Total" "Total" ...
#> $ x : chr [1:2346] "2020-01-12T00:00:00" "2020-01-19T00:00:00" "2020-01-26T00:00:00" "2020-02-02T00:00:00" ...
#> $ y : num [1:2346] 100 100.1 100.7 99.3 97.8 ...
#> $ date: POSIXct[1:2346], format: "2020-01-12" "2020-01-19" ...
head(df)
#> # A tibble: 6 × 4
#> name x y date
#> <chr> <chr> <dbl> <dttm>
#> 1 Total 2020-01-12T00:00:00 100 2020-01-12 00:00:00
#> 2 Total 2020-01-19T00:00:00 100. 2020-01-19 00:00:00
#> 3 Total 2020-01-26T00:00:00 101. 2020-01-26 00:00:00
#> 4 Total 2020-02-02T00:00:00 99.3 2020-02-02 00:00:00
#> 5 Total 2020-02-09T00:00:00 97.8 2020-02-09 00:00:00
#> 6 Total 2020-02-16T00:00:00 100. 2020-02-16 00:00:00
p <- df %>%
ggplot(aes(x = date, y = y, color = name)) +
geom_path() +
theme_minimal()
p
Created on 2022-09-27 with reprex v2.0.2
You're trying to scrap the wrong URL - the one you've provided uses an iframe with the chart. You should take a deep look into the source code of this page instead (the iframe source): https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html

Calculating measure of spatial segregation?

There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")

Apply an API Function over 2 columns of Dataframe, Output a Third Column

Here are the first four columns of a dataframe that is 20K long.
# A tibble: 4 x 4
Address Address_Total lon lat
<chr> <dbl> <dbl> <dbl>
1 !500 s. dobson rd., mesa, AZ, 852… 14.1 -112. 33.4
2 # l10, jackson, MS, 39202, United… 16.1 NA NA
3 0 fletcher allen health care, bur… 300 -73.2 44.5
4 00 w main st # 110, babylon, NY, … 287. NA NA
I want to convert the geocodes in the dataframe (lon and lat values) to county codes (FIPS). I found a great script that does that using FCC API. All you need is to input two lat / long values:
geo2fips <- function(latitude, longitude) {
url <- "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%f&longitude=%f"
url <- sprintf(url, latitude, longitude)
json <- RCurl::getURL(url)
json <- RJSONIO::fromJSON(json)
as.character(json$County['FIPS'])
}
For instance, if I insert a combo of lat / long it comes up with this, which is perfect:
> # Orange County
> geo2fips(28.35975, -81.421988)
[1] "12095"
What I want to do is to use some member of the apply family to run geo2fips over the entire dataset from top to bottom. I would like the output to be a fifth column of my dataframe called "FIPS" or something like that, which just contains the FIPS codes.
Can anyone help? Been at this for hours and I can't get it to work. I'm sure there's just some syntax with using the apply family and I'm pretty sure that it's my fault because I'm not coercing the dataframe columns correctly.

ggmap in R: How do I extract individual location features from geocoding?

I'm trying to clean up user inputted addresses, so I thought using GGMAP to extract the Longitude/Latitude and Address used would be a way to clean everything up. However, the Address it spits out sometimes has colloquial names in the address and it makes it hard to parse out the individual location aspects.
Here's the code I'm using
for(i in 1:nrow(Raw_Address))
{
result <- try(geocode(Raw_Address$Address_Total[i], output = "more", source = "google"))
Raw_Address$lon[i] <- as.numeric(result[1])
Raw_Address$lat[i] <- as.numeric(result[2])
Raw_Address$geoAddress[i] <- as.character(result[3])
}
I tried changing the "latlona" to "more" and going through the result numbers, but only got back different longitude/latitudes. I didn't see anywhere in the documentation that shows the results vectors.
Basically, I want Street Name, City, State, Zip, Longitude, and Latitude.
Edit: Here's an example of the data
User Input: 1651 SE TIFFANY AVE. PORT ST. LUCIE FL
GGMAP Output: martin health systems - tiffany ave., 1651 se tiffany ave, port st. lucie, fl 34952, usa
This is hard to parse because of the colloquial name. I could use the stringr package to try and parse, but it probably wouldn't be all inclusive. But it returns a distinct address while some users spell "Tiffany" wrong or spell out "Saint" instead of "St."
Rather than using a for loop, purrr::map_dfr will iterate over a vector and rbind the resulting data frames into a single one, which is handy here. For example,
library(tidyverse)
libraries <- tribble(
~library, ~address,
"Library of Congress", "101 Independence Ave SE, Washington, DC 20540",
"British Library", "96 Euston Rd, London NW1 2DB, UK",
"New York Public Library", "476 5th Ave, New York, NY 10018",
"Library and Archives Canada", "395 Wellington St, Ottawa, ON K1A 0N4, Canada"
)
library_locations <- map_dfr(libraries$address, ggmap::geocode,
output = "more", source = "dsk")
This will output a lot of messages, some telling you what geocode is calling, e.g.
#> Information from URL : http://www.datasciencetoolkit.org/maps/api/geocode/json?address=101%20Independence%20Ave%20SE,%20Washington,%20DC%2020540&sensor=false
and some warning that factors are being coerced to character:
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
which they should be, so you can ignore them all. (If you really want you can write more code to make them go away, but you'll end up with the same thing.)
Combine the resulting data frames, and you get all the location data linked to your original dataset:
full_join(libraries, library_locations)
#> Joining, by = "address"
#> # A tibble: 4 x 15
#> library address lon lat type loctype north south east west
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Librar… 101 In… -77.0 38.9 stre… rooftop 38.9 38.9 -77.0 -77.0
#> 2 Britis… 96 Eus… -0.125 51.5 stre… rooftop 51.5 51.5 -0.124 -0.126
#> 3 New Yo… 476 5t… -74.0 40.8 stre… rooftop 40.8 40.8 -74.0 -74.0
#> 4 Librar… 395 We… -114. 60.1 coun… approx… 83.1 41.7 -52.3 -141.
#> # … with 5 more variables: street_number <chr>, route <chr>,
#> # locality <chr>, administrative_area_level_1 <chr>, country <chr>
You may notice that Data Science Toolkit has utterly failed to geocode Libraries and Archives Canada, for whatever reason—it's marked as a country instead of an address. Geocoders are faulty sometimes. From here, subset out whatever you don't need.
If you want even more information, you can use geocode's output = "all" method, but that returns a list you'll need to parse, which takes more work.

R remove duplicated values based on a column and replace other column values with the mean of duplicated rows

I'm working with a data.frame which has 6 environmental variables of interest which are georeferenced by location. The problem I have is that some of the locations are duplicated but all the environmental variables are unique measurements.
Unfortunately, the modelling I want to do with these data will not work if there are duplicate locations. But I do not wish to arbitrarily throw away data by keeping only one duplicated row.
So, I'm looking for a method of taking the means for each of the 6 variables for each set of duplicates and then ascribing that mean to each variable and the location thereby preserving the information from the multiple measurements.
I've attempted this in several ways but I can't quite seem to get it right!
The data I'm working with can be downloaded here:
(https://www.dropbox.com/sh/xnwp3zz5abnilyo/AABRVJZ0kTmWk0T9Fcp4-bVSa?dl=0/)
This is how I've attempted this :
library(rgdal)
library(sp)
library(maptools)
#load data
hs1<- readOGR (".", "Hollicombe_S1_L1-5_A1.2")
#remove columns we're not interested in
hs1<- subset(hs1, select = -c(1:16, 23:24)
So I start with hs1 - a SPDF with 552 obs and 6 variables...
#check for duplicate location (present if lengths differ)
length(hs1#coords)
[1] 1104
length(unique(hs1#coords))
[1] 730
#duplicates confirmed
hs1.d <- hs1[duplicated(hs1#coords),] # creates new SPDF with only duplicated locations (?)
hs1.u <- hs1[!duplicated(hs1#coords),] # creates new SPDF with only unique locations
# coerce duplicated locations SPDF to an ordinary data frame
hs1.md<- as.data.frame(hs1.d)
# combine the X&Y into a single "location"
hs1.md <- within(hs1.md,
Location <- paste(coords.x1, coords.x2, sep = ","))
# aggregate duplicate locations and calculate a mean value for each
means_by_location<- aggregate (cbind(BioArea,BioVolume,MeanBioHei,MaxBioheig,PerArIn, PerVolIn)~Location, hs1.md, mean)
#split location back to X&Y
lat_long <- strsplit(means_by_location$Location, ",")
means_by_location$coords.x1 <- sapply(lat_long, function(x) x[1]) #adds X data back
means_by_location$coords.x2 <- sapply(lat_long, function(x) x[2])#adds Y data back
means_by_location$coords.x1 <- as.numeric (means_by_location$coords.x1) #converts to numeric
means_by_location$coords.x2 <- as.numeric (means_by_location$coords.x2)#converts to numeric
# add spatial information back in to create SPDF
coordinates(means_by_location) = ~coords.x1+coords.x2 # adds the locations
proj4string(means_by_location) = CRS(proj4string(hs1)) # sets the CRS
# hs1.md as SPDF containing single rows for previously duplicated locations
# with mean values for each variable
hs1.md <- subset(means_by_location, select = -(1))
#merge hs1.md and hs1.u to create new SPDF without duplicates
hs1 <- spRbind (hs1.u, hs1.md)
So hs1 is now a SPDF with 543 obs (i.e. 9 observations have been removed).
But there still remain duplicate locations and the number of unique locations remains the same :
length(hs1#coords) # total number of locations
[1] 1086
length(unique(hs1#coords)) #number of unique locations
[1] 730
I suspect I've incorrectly seperated the unique from the duplicated observations somewhere but my knowledge of R is not sufficient enough me to spot this. Can anyone see where I have gone wrong? Or does anybody know an alternative way I can achieve this?
As per my comment, the answer to this is a bit tricky as what's considered a duplicate is probably dependent on accuracy.
On loading your shapefile I saw each measurement is a line, with an origin, end, and centre. The centre seemed to match the coordinates given in the shapefile.
Assuming the centres are in fact the coordinates, I would use the new dplyr verbs in the sf package:
library("tidyverse")
library("sf")
hs1 = read_sf(".", "Hollicombe_S1_L1-5_A1")
nrow(hs1)
# 552
nrow(hs1[duplicated(hs1$geometry), ])
# 187
So we have 552 cases with 187 duplicates (i.e. 365 locations). To obtain the mean for duplicated locations use group_by() and summarise():
hs1 = hs1 %>%
group_by(CentrePos1, CentrePos_) %>%
summarise(
BioArea = mean(BioArea),
BioVolume = mean(BioVolume),
MeanBioHei = mean(MeanBioHei),
MaxBioheig = mean(MaxBioheig),
PerArIn = mean(PerArIn),
PerVolIn = mean(PerVolIn)
)
hs1
# Simple feature collection with 365 features and 8 fields
# geometry type: POINT
# dimension: XY
# bbox: xmin: -3.548833 ymin: 50.44483 xmax: -3.542333 ymax: 50.45167
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 365 x 9
# Groups: CentrePos1 [59]
# CentrePos1 CentrePos_ BioArea BioVolume MeanBioHei MaxBioheig PerArIn PerVolIn geometry
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <simple_feature>
# 1 -3.548833 50.44500 0.00000 0.00000 0.192 0.216 -1.000 -1.000 <POINT (-3.54...>
# 2 -3.548833 50.44533 2.27280 0.41470 0.182 0.264 91.410 2.810 <POINT (-3.54...>
# 3 -3.548744 50.44500 6.75470 1.21780 0.180 0.216 74.890 2.210 <POINT (-3.54...>
# 4 -3.548667 50.44506 5.02900 1.14660 0.228 0.228 100.000 3.720 <POINT (-3.54...>
# 5 -3.548667 50.44517 8.24895 1.86555 0.225 0.330 96.550 3.530 <POINT (-3.54...>
# 6 -3.548667 50.44532 10.31200 2.04180 0.198 0.204 100.000 3.210 <POINT (-3.54...>
# 7 -3.548667 50.44536 18.61980 3.67040 0.197 0.276 100.000 3.280 <POINT (-3.54...>
# 8 -3.548667 50.44550 3.31670 0.73700 0.222 0.300 96.150 3.550 <POINT (-3.54...>
# 9 -3.548500 50.44533 6.22370 1.74670 0.269 0.372 81.555 3.470 <POINT (-3.54...>
# 10 -3.548500 50.44550 6.00740 1.00090 0.168 0.234 80.905 2.215 <POINT (-3.54...>
# ... with 355 more rows
You can see there are 365 rows, and no duplicates:
any(duplicated(hs1$geometry))
# FALSE
The new columns have the mean values based on the grouping we performed earlier. If the observation location was unique its original value was returned (well, it's original value divided by 1 I suppose).
I should point out that sf is replacing sp, rgdal, and rgeos in R, but if you do want to continue using those packages you can convert your sf object into spatialPointsDataFrame with as_Spatial():
hs1_data = st_set_geometry(hs1, NULL)
hs1 = as_Spatial(hs1$geometry)
hs1 = SpatialPointsDataFrame(hs1, hs1_data)

Resources