Related
Hi I need to write a function to calculate R1 which is defined as follows :
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package I managed to calculate the Hirsch point
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
To calculate F(h)—the cumulative relative frequencies up to h_point— to put in R1, I need two values; one of them needs to be from Fs$rank and the other must be from h_values. Consider the first original texts (tstat_by[[1]], tstat_by[[2]], and tstat_by[[3]]) and their respective h_values(h_values[[1]], h_values[[2]], and h_values[[3]]):
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...
tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75
As you can see, the grouping is the same— docnames for each chunk of the original character vectors are the same (text1, text2, text3, etc.). my question is how to write a function for fh_txt(s) so that using lapply can be an option to calculate F(h) for R1.
Please note that the goal is to write a function to calculate R1, and what I`ve put here is what has been done in this regard.
I've simplified your inputs below, and used the groups argument in textstat_frequency() instead of your approach to creating lists of dfm objects.
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a, a1, a2) %>%
tokens() %>%
dfm()
tstat <- quanteda.textstats::textstat_frequency(dfmat, groups = docnames(dfmat), ties = "first")
tstat_by <- split(tstat, tstat$group)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
h_values <- vapply(tstat_by, get_h_point, double(1))
h_values
## text1 text2 text3
## 2.000014 1.500000 2.000024
tstat_by <- lapply(
names(tstat_by),
function(x) subset(tstat_by[[x]], cumsum(rank) <= h_values[[x]])
)
do.call(rbind, tstat_by)
## feature frequency rank docfreq group
## 1 the 2 1 1 text1
## 29 the 2 1 1 text2
## 48 the 3 1 1 text3
You didn't specify what you wanted for output, but with this result, you should be able to compute your own either on the list using lapply(), or on the combined data.frame using for instance dplyr.
Created on 2021-04-05 by the reprex package (v1.0.0)
I am trying to use very well written instructions from this blog: https://www.jessesadler.com/post/geocoding-with-r/ to geocode locational data in R including specific cites and cities in Hawaii. I am having issues pulling information from Google. When running mutate_geocode my data runs but no output is gathered. I bypassed this for the time being with manual entry of lat and lon for just one location of my dataset, attempting to trouble shoot. Now, when I use get_googlemap, I get the error message "Error in Download File"
I have tried using mutate_geocode as well as running a loop using geocode. I either do not get output or I get the OVER_QUERY_LIMIT error (which seems to be very classic). After checking my query limit I am nowhere near the limit.
Method 1:
BH <- rename(location, place = Location)
BH_df <- as.data.frame(BH)
location_df <- mutate_geocode(HB, Location)
Method 2:
origAddress <- read.csv("HSMBH.csv", stringsAsFactors = FALSE)
geocoded <- data.frame(stringsAsFactors = FALSE)
for(i in 1:nrow(origAddress))
{
result <- geocode(HB$Location[i], output = "latlona", source = "google")
HB$lon[i] <- as.character(result[1])
HB$lat[i] <- as.character(result[2])
HB$geoAddress[i] <- as.character(result[3])
}
Post Manual Entry of lon and lat points I run in to this error:
map <- get_googlemap(center = c(-158.114, 21.59), zoom = 4)
I am hoping to gather lat and lon points for my locations, and then be able to use get_googlemap to draft a map with which I can plot density points of occurrences (I have the code for the points already).
Alternatively, you can use a one-liner for rapid geocoding via tmaptools::geocode_OSM():
Data
library(tmaptools)
addresses <- data.frame(address = c("New York", "Berlin", "Huangpu Qu",
"Vienna", "St. Petersburg"),
stringsAsFactors = FALSE)
Code
result <- lapply(addresses[, 1], geocode_OSM)
> result
$address
query lat lon lat_min lat_max lon_min lon_max
1 New York 40.73086 -73.98716 40.47740 40.91618 -74.25909 -73.70018
2 Berlin 52.51704 13.38886 52.35704 52.67704 13.22886 13.54886
3 Huangpu Qu 31.21823 121.48030 31.19020 31.24653 121.45220 121.50596
4 Vienna 48.20835 16.37250 48.04835 48.36835 16.21250 16.53250
5 St. Petersburg 27.77038 -82.66951 27.64364 27.91390 -82.76902 -82.54062
This way, you have both
the centroids (lon, lat) that are important for Google Maps and
boundary boxes (lon_min, lat_min, lon_max, lat_max) that mapping services like OSM or Stamen need.
I am trying to plot the USA map with ALASKA and then project a set of points on it. I used the posted functions fixup and fix1 in the following link to relocate Alaska & Hawaii.
Relocating Alaska and Hawaii on thematic map of the USA with ggplot2
Then project my points that have latitude and longitude values as dput below
my_points= structure(c(44.334567, 44.209571, 44.049845, 44.752622, 44.791511,
44.391792, 44.540124, 46.075669, 46.007611, 45.836781, 46.595384,
45.703999, 45.47594, 44.715117, 44.385954, 42.804004, 43.349842,
43.252618, 42.891499, 56.212978, 55.391604, 55.395194, 60.476929,
61.709743, 62.76729, 62.346439, 59.93483, 64.472359, 64.88569,
-122.047007, -122.256733, -123.42621, -122.128965, -122.578974,
-122.497582, -122.435915, -121.998698, -122.347319, -122.466208,
-122.459556, -123.755405, -123.725121, -123.887335, -123.831778,
-123.610907, -122.728942, -123.026172, -124.070652, -131.638358,
-131.195583, -132.408627, -151.081668, -149.231938, -149.693379,
-150.019201, -158.190006, -146.926265, -147.249648), .Dim = c(29L,
2L))
My code is:
require(maptools)
require(rgdal)
fixup <- function(usa,alaskaFix,hawaiiFix){
alaska=usa[usa$STATE_NAME=="Alaska",]
alaska = fix1(alaska,alaskaFix)
proj4string(alaska) <- proj4string(usa)
hawaii = usa[usa$STATE_NAME=="Hawaii",]
hawaii = fix1(hawaii,hawaiiFix)
proj4string(hawaii) <- proj4string(usa)
usa = usa[! usa$STATE_NAME %in% c("Alaska","Hawaii"),]
usa = rbind(usa,alaska,hawaii)
return(usa)
}
fix1 <- function(object,params){
r=params[1];scale=params[2];shift=params[3:4]
object = elide(object,rotate=r)
size = max(apply(bbox(object),1,diff))/scale
object = elide(object,scale=size)
object = elide(object,shift=shift)
object
}
## download the USA shapefile to the computer
setwd(tempdir())
download.file("https://dl.dropbox.com/s/wl0z5rpygtowqbf/states_21basic.zip?dl=1",
"usmapdata.zip",
method = "curl")
#This is a mirror of http://www.arcgis.com/home/item.html?
#id=f7f805eb65eb4ab787a0a3e1116ca7e5
unzip("usmapdata.zip")
Then read in my shapefile. Use rgdal:
us = readOGR(dsn = "states_21basic",layer="states")
Now transform to equal-area, and run the fixup function:
usAEA = spTransform(us,CRS("+init=epsg:2163"))
usfix = fixup(usAEA,c(-35,1.5,-2800000,-2600000),c(-35,1,6800000,-1600000))
plot(usfix)
The parameters are rotations, scaling, x and y shift for Alaska and Hawaii respectively, and were obtained by trial and error. Tweak them carefully. Even changing Hawaii's scale parameter to 0.99999 sent it off the planet because of the large numbers involved.
Turn this back to lat-long:
usfixLL = spTransform(usfix,CRS("+init=epsg:4326"))
plot(usfixLL)
Now, working on my points to add them to the produced map
require(sp)
my_NEW_S<-as.data.frame(my_points)
colnames(my_NEW_S)<-c("y","x")
coordinates(my_NEW_S)=~x+y #column names of the lat long cols
proj4string(my_NEW_S)<- CRS("+init=epsg:2163")
# add my points to the produced map
points(my_NEW_S,col="red",pch=16,cex=0.5)
All the points of conterminous US are projected on the map EXCEPT the ones are related to Alaska and HI. It would be greatly appreciated if someone helps me with how to make my Alaska points projection as same as the relocated Alaska, or any other solution. Thank you.
I am trying to learn the Leaflet functionality with rCharts and would like to plot multiple markers and popups garnered from a data.frame object in R
df <- data.frame(location = c("White House", "Impound Lot", "Bush Garden", "Rayburn", "Robertson House", "Beers Elementary"), latitude = c(38.89710, 38.81289, 38.94178, 38.8867787, 38.9053894, 38.86466), longitude = c(-77.036545, -77.0171983, -77.073311, -77.0105317, -77.0616441, -76.95554))
df
location latitude longitude
1 White House 38.89710 -77.03655
2 Impound Lot 38.81289 -77.01720
3 Bush Garden 38.94178 -77.07331
4 Rayburn 38.88678 -77.01053
5 Robertson House 38.90539 -77.06164
6 Beers Elementary 38.86466 -76.95554
I tried modifying the code from the example on Ramnanth's rCharts page. This is my modification:
map <- Leaflet$new()
map$setView(c(38.89710, -77.03655), 12)
map$tileLayer(provider = 'Stamen.TonerLite')
map$marker(c(df$latitude, df$longitude), bindPopup = df$location)
This code does not produce any markers. I am looking for a solution whereby I can plot the lat and lon for each observation and have a marker with a popup populated by the value in the location column.
That won't work because df$latitude returns a vector with all the latitudes from your dataframe:
df$latitude
## [1] 38.89710 38.81289 38.94178 38.88678 38.90539 38.86466
df$longitude
##b[1] -77.03655 -77.01720 -77.07331 -77.01053 -77.06164 -76.95554
You'll need to add you markers in a loop:
for (i in 1:nrow(df)) {
map$marker(c(df[i, "latitude"], df[i, "longitude"]), bindPopup = df[i, "location"])
}
Note: did this quickly, freehand, just starting out with R and am unable to test this at the moment.
I'm trying to get the Zip codes of a (long) list of Longitude Latitude coordinates by using the revgeodcode function in the ggmap library.
My question & data are the same as here: Using revgeocode function in a FOR loop. Help required but the accepted answer does not work for me.
My data (.csv):
ID, Longitude, Latitude
311175, 41.298437, -72.929179
292058, 41.936943, -87.669838
12979, 37.580956, -77.471439
I follow the same steps:
data <- read.csv(file.choose())
dset <- as.data.frame(data[,2:3])
location = dset
locaddr <- lapply(seq(nrow(location)), function(i){
revgeocode(location[i,],
output = c("address"),
messaging = FALSE,
sensor = FALSE,
override_limit = FALSE)
})
... and get the error message: "Error: is.numeric(location) && length(location) == 2 is not TRUE"
Specifically, is.numeric(location) is FALSE, which seems strange because I can multiply by 2 and get the expected answer.
Any help would be appreciated.
There are lots of things wrong here.
First, you have latitude and longitude reversed. All the locations in your dataset, as specified, are in Antarctica.
Second, revgeocode(...) expects a numeric vector of length 2 containing the longitude and latitude in that order. You are passing a data.frame object (this is the reason for the error), and as per (1) it's in the wrong order.
Third, revgeocode(...) uses the google maps api, which limits you to 2500 queries a day. So if you really do have a large dataset, good luck with that.
This code works with your sample:
data <- read.csv(text="ID, Longitude, Latitude
311175, 41.298437, -72.929179
292058, 41.936943, -87.669838
12979, 37.580956, -77.471439")
library(ggmap)
result <- do.call(rbind,
lapply(1:nrow(data),
function(i)revgeocode(as.numeric(data[i,3:2]))))
data <- cbind(data,result)
data
# ID Longitude Latitude result
# 1 311175 41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058 41.93694 -87.66984 1632 West Nelson Street, Chicago, IL 60657, USA
# 3 12979 37.58096 -77.47144 2077-2199 Seddon Way, Richmond, VA 23230, USA
This extracts the zipcodes:
library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]
# ID Longitude Latitude zipcode
# 1 311175 41.29844 -72.92918 06519
# 2 292058 41.93694 -87.66984 60657
# 3 12979 37.58096 -77.47144 23230
I've written the package googleway to access google maps API with a valid API key. So if your data is greater than 2,500 items you can pay for an API key, and then use googleway::google_reverse_geocode()
For example
data <- read.csv(text="ID, Longitude, Latitude
311175, 41.298437, -72.929179
292058, 41.936943, -87.669838
12979, 37.580956, -77.471439")
library(googleway)
key <- "your_api_key"
res <- apply(data, 1, function(x){
google_reverse_geocode(location = c(x["Latitude"], x["Longitude"]),
key = key)
})
## Everything contained in 'res' is all the data returnd from Google Maps API
## for example, the geometry section of the first lat/lon coordiantes
res[[1]]$results$geometry
bounds.northeast.lat bounds.northeast.lng bounds.southwest.lat bounds.southwest.lng location.lat location.lng
1 -61.04904 180 -90 -180 -75.25097 -0.071389
location_type viewport.northeast.lat viewport.northeast.lng viewport.southwest.lat viewport.southwest.lng
1 APPROXIMATE -61.04904 180 -90 -180
To extract the zip code just write down:
>data$postal_code