R geocode query error when address has hash - r

An address containing a "#" (for example an apartment number) frequently gives an incorrect location result, both with ggmap::geocode and with google maps as well, so this is not strictly an R question. In this example, adding a "#3" after the street address changes the location result from Illinois to California:
> test <- geocode('1200 Davis St, Evanston, IL 60202', source='google', output='more')
> test[, c('lon', 'lat', 'administrative_area_level_1')]
lon lat administrative_area_level_1
1 -87.68978 42.04627 Illinois
> testhash <- geocode('1200 Davis St #3, Evanston, IL 60202', source='google', output='more')
> testhash[, c('lon', 'lat', 'administrative_area_level_1')]
lon lat administrative_area_level_1
1 -122.1692 37.72169 California
If you experiment with google maps directly, sometimes adding a hash into an address seems to confuse the lookup, generating a variety of geographically dispersed results. This doesn't always happen, but in my experience happens frequently. It's easily fixed (there's no need for an apartment number when geocoding) but I'm wondering why it happens and if there are other cautions about entering addresses.

Google has recommendations in regard to address formatting in Geocoding API. Particularly they suggest do not specify additional elements like apartment number or floor numbers in requests.
You can check the complete list of recommendations in Google Maps FAQ:
https://developers.google.com/maps/faq#geocoder_queryformat
The relevant part is
Do not specify additional address elements such as business names, unit numbers, floor numbers, or suite numbers that are not included in the address as defined by the postal service of the country concerned. Doing so may result in responses with ZERO_RESULTS.
I hope this helps!

Related

How to count the frequency of unique words in a column?

I need to count the frequency of unique words found in a column that contains descriptions in each row.
So far, I have eliminated a list of stopwords from the original column and I have extracted the unique words from the column and put them into a list called unique_description.
> description[1:5]
[1] "Come stay Vinh & Stuart (Awarded one Australia's top hosts Airbnb CEO Brian Chesky & key shareholder Ashton Kutcher. We're Sydney's #1 reviewed hosts ). Find out 've positively reviewed 500+ times. Message talk first BEFORE make reservation request - And please read listing end (hint hint). Everything need know . We're pretty relaxed hosts, fully appreciate staying someone , home home, -one. This business, hotel. We're casual Airbnb hosts, hoteliers. If 're looking alternative expensive hotel, 're . Here 'll treated same way treat family & friends stay. So... fluffy bathrobes... Please hello message *BEFORE* make reservation request... It'll help speed things up, smooth things out... Please read listing way end. It make getting confirmed reserv"
[2] "Beautifully renovated, spacious quiet, 3 Bedroom, 3 Bathroom home 10 minute walk beaches Fairlight Forty Baskets, 30 minute walk Manly via coastal promenade, Express bus runs 20 mins door. Our home thirty minute walk along seashore promenade Manly, one Sydney's beautiful beaches, village restaurants, cafes, shopping. If prefer more variety, Manly ferry take Sydney CBD 15 minutes. The residence sited sought- family-friendly street short stroll nearby North Harbour reserve Forty Baskets cafe beach. It's short walk further express CBD buses, ferries, Manly entertainment. Or bus (#131 #132) around corner drops Manly 8 minutes. Our home features stainless steel galley kitchen, including Ilve oven gas cooktop. We two separate living areas ground floor. The front lounge enjoys P&O"
[3] "Welcome sanctuary - bright, comfortable one bedroom apartment North Sydney. Free Wifi, heated pool/jacuzzi everything need make stay Sydney very comfortable. Enjoy fabulous Home away home, fantastic stay Sydney! The apartment within walking distance restaurants shops, Luna Park North Sydney business district. Access Sydney CBD easy bus, train, taxi ferry. It short bus ride famous Balmoral Beach Taronga Zoo. My apartment situated North Sydney 3 kms Sydney CBD. Here details apartment: You'll enjoy being centrally located couple blocks away train station go anywhere quickly Sydney. The apartment features several windows tons natural light. It comfortable fully stocked. Here's I here: LIVING ROOM: 50\" LCD TV DVD / blu ray player CD/Radio/Blue tooth syncing w"
[4] "Fully self-contained sunny studio apartment. 10mn walk Bondi beach. Bus city door. Private 13m swimming pool. Sunny, studio apartment . Private terrace. bus door Bondi Junction City Ground floor 1 bedroom double bed plus kitchenette & study desk. shower & toilet, share laundry, kitchen facilities Swimming pool 13m. Separate security private entrance Private entrance. Ground floor. Happy indicate best spots walking, dining, entertaining best sightseeing location Sydney. Upmarket area. Very nice quiet neighbourhood . Very safe place. Bus door city."
[5] "Sunny warehouse/loft apartment heart one Sydney's best neighbourhoods. Located corner two iconic Darlinghurst streets famous laneway bars eateries, footsteps equally amazing Surry Hills Potts Point. Walk through beautiful parks city less 10 mins, opera house 20 access Bondi Beach easily 25 via bus stop directly front building. My apartment beautiful, simple, open plan / one bedroom loft soaring high ceilings hardwood floors hint 's previous life printing factory 1940s. It huge windows flood space glorious sunshine throughout day provide refreshing breeze during summer. A few key features: * Wireless harman/kardon aura stereo system stream music wirelessly bluetooth device * Internal laundry washer dryer * The kitchen equipped gas cooking, microwave, dishwasher basics preparing m"
> unique_description[1:10]
[1] "Come" "stay" "Vinh" "&" "Stuart" "(Awarded"
[7] "one" "Australia's" "top" "hosts"
I'm not sure how to count the frequency of the words in unique_description that are found in the column 'description'. I tried using freq_terms in library(qdap), but qdap will not load for me so am trying to find another way.
You could use the stringr package.
library(stringr)
x <- "Come stay Vinh & Stuart (Awarded one Australia's top hosts Airbnb CEO Brian Chesky & key shareholder Ashton Kutcher. We're Sydney's #1 reviewed hosts ). Find out 've positively reviewed 500+ times. Message talk first BEFORE make reservation request - And please read listing end (hint hint). Everything need know . We're pretty relaxed hosts, fully appreciate staying someone , home home, -one. This business, hotel. We're casual Airbnb hosts, hoteliers. If 're looking alternative expensive hotel, 're . Here 'll treated same way treat family & friends stay. So... fluffy bathrobes... Please hello message *BEFORE* make reservation request... It'll help speed things up, smooth things out... Please read listing way end. It make getting confirmed reserv"
y <- "come stay Stuart"
unique_desc <- c("come", "stay", "Stuart")
desc <- c(x,y)
result <- lapply(desc, FUN = str_count, pattern = unique_desc)
#result holds first element is counts in first element of desc
lapply will call the str_count function to each element of desc. In this example the ith entry of result there is a vector of counts corresponding to the ith entry of desc and the vector of counts correspond to the counts of each word in unique_desc.

Usa calculate Tolls cost using here maps

Using this example https://developer.here.com/documentation/toll-cost/topics/example-tollcost.html
and description from here https://developer.here.com/documentation/toll-cost/topics/resource-tollcost-input-interface.html trying to calculate tolls on usa roads
First found paid road to be sure to have it on the route - found here https://www.sixt.com/toll-roads/usa/ - rout in Florida from Punta Gorda, FL, United States to Islamorada, FL, United States
From route received two links
0: {linkId: "-91500228", mappedPosition: {latitude: 26.9336687, longitude: -82.0531188},…}
1: {linkId: "+897488196", mappedPosition: {latitude: 24.9598652, longitude: -80.5699824},…}
Then do tolls request
https://tce.api.here.com/2/tollcost.json
?app_id=my_app_id
&app_code=my_code
&tollVehicleType=3
&vehicleNumberAxles=2
&emissionType=6
&height=3.5m
&vehicleWeight=10.0t
&limitedWeight=10.0t
&passengersCount=1
&tiresCount=8
&route=-91500228;897488196
&detail=1
In response always receive:
{"errors":[],"warnings":[{"category":1,"context":"Old response format is deprecated, please use request parameter &rollup"}],"countries":[],"onError":false}
Tried different locations in usa, no matter what - always empty countries array
Please advice what I'm missing, thank you
Here is a working example:
https://tce.cit.api.here.com/2/calculateroute.json?jsonAttributes=41&waypoint0=26.13442,-81.68696&detail=1&waypoint1=26.08104,-80.36682&routelegattributes=li&routeattributes=gr&maneuverattributes=none&linkattributes=none,rt,fl&legattributes=none,li,sm&currency=EUR&departure=&tollVehicleType=3&emissionType=5&height=3.8m&vehicleWeight=11000&limitedWeight=11t&passengersCount=1&tiresCount=4&commercial=1&heightAbove1stAxle=1m&width=2.55&length=10&mode=fastest;truck;traffic:enabled&rollup=none,country;tollsys&app_id=yyy&app_code=xxx
In general is the parameter rollup missing in your request. This one need to be added in any case. Besides of that is the route you try to calculate quite far over islands with a truck with a specific size and weight. It is quite difficult to get an idea within the pure HTTPRest request. Maybe this demo page helps a bit for debugging such cases:
https://tcs.ext.here.com/examples/v3/route_with_tce

Bayesian networks for text analysis in R

I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()

How to search a word in a dictionary with a document in R?

I have created a dictionary of words. Now I need to check whether the word in the dictionary is present in the document or not. The sample of the document is given below:
Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.
There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location.
From this document, I have split each paragraph into each sentence as follows:
[1] "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body."
[2] "There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure."
[3] "Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter."
[4] "The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."
The dictionary includes the following words:
Laparoscopic surgery
minimally invasive surgery
bandaid surgery
keyhole surgery
surgical technique
small incisions
fiber optic cable system
Now I want to search for all words in the dictionary with those in each sentence using R? The code that I have worked out is given below.
c <- "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.
There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."
library(tm)
library(openNLP)
convert_text_to_sentences <- function(text, lang = "en") {
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}
q <- convert_text_to_sentences(c)
Assuming q is a character vector (or list) of the sentences and you're interested in exact matches of the keywords only, then you can use regular expressions:
matches = lapply(q, function(x) dict[sapply(dict, grepl, x, ignore.case=T)])
You get a list of the length of q. Every list element contains a vector of the dictionary words found in the according sentence.

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

Resources