google geocoding and haversine distance calculation in R - r

I am using the geocode function from the ggmap package to geocode country names and then passing them onto the distHaversine in the geosphere library to calculate the distance between two countries.
Sample of my data is as follows:
Country.Value Address.Country
1: United States United States
2: Cyprus United States
3: Indonesia United States
4: Tanzania Tanzania
5: Madagascar United States
6: Belize Canada
7: Argentina Argentina
8: Egypt Egypt
9: South Africa South Africa
10: Paraguay Paraguay
I have also used if-else statements to try and stay within the geocoding limits set by the free Google Maps geocoder. My code is as follows:
for(i in 1:nrow(df)) {
row<-df.cont.long[i,]
src_lon<- 0.0
src_lat<- 0.0
trgt_lon<- 0.0
trgt_lat<- 0.0
if((row$Country.Value=='United States')){ #Reduce geocoding requirements
trgt_lon<- -95.7129
trgt_lat<- 37.0902
}
else if((row$Address.Country=='United States')){ #Reduce Geocoding Requirements
src_lon<- -95.7129
src_lat<- 37.0902
}
else if((row$Country.Value=='Canada')){ #Reduce geocoding requirements
trgt_lon<- -106.3468
trgt_lat<- 56.1304
}
else if((row$Primary.Address.Country=='Canada')){ #Reduce Geocoding Requirements
src_lon<- -106.3468
src_lat<- 56.1304
}
else if(row$Country.Value == row$Address.Country){ #Reduce Geocoding Requirements
# trgt<-geocode(row$Country.Value)
# trgt_lon<-as.numeric(trgt$lon)
# trgt_lat<-as.numeric(trgt$lat)
# src_lon<-as.numeric(trgt$lon)
# src_lat<-as.numeric(trgt$lat)
}
else{
trgt<-geocode(row$Country.Value, output=c("latlon"))
trgt_lon<-as.numeric(trgt$lon)
trgt_lat<-as.numeric(trgt$lat)
src<-geocode(row$Address.Country)
src_lon<-as.numeric(src$lon)
src_lat<-as.numeric(src$lat)
}
print(i)
print(c(row$Address.Country, src_lon, src_lat))
print(c(row$Country.Value, trgt_lon, trgt_lat))
print(distHaversine( p1=c(as.numeric(src$lon), as.numeric(src$lat)), p2=c(as.numeric(trgt$lon), as.numeric(trgt$lat)) ))
}
In the output
Sometimes geocoding is done, sometimes not, and is defaulting to 0.0
Sometimes distance is getting calculated, sometimes not
I have no idea where the code is going wrong.
Moreover, uncommenting the lines where I check if the Country.Value and Address.Country are equal, makes things even worse.

The functions you're using are vectorized, so all you really need is
library(ggmap)
library(geosphere)
distHaversine(geocode(as.character(df$Country.Value)),
geocode(as.character(df$Address.Country)))
# [1] 0 10432624 14978567 0 15868544 4588708 0 0 0 0
Note the as.characters are there because ggmap::geocode doesn't like factors. The results make sense:
df$distance <- distHaversine(geocode(as.character(df$Country.Value), source = 'dsk'),
geocode(as.character(df$Address.Country), source = 'dsk'))
df
# Country.Value Address.Country distance
# 1 United States United States 0
# 2 Cyprus United States 10340427
# 3 Indonesia United States 14574480
# 4 Tanzania Tanzania 0
# 5 Madagascar United States 16085178
# 6 Belize Canada 5172279
# 7 Argentina Argentina 0
# 8 Egypt Egypt 0
# 9 South Africa South Africa 0
# 10 Paraguay Paraguay 0
Edit
If you don't want to use ggmap::geocode, tmap::geocode_OSM is another geocoding function that uses OpenStreetMap data. However, because it is not vectorized, you need to iterate over it columnwise:
distHaversine(t(sapply(df$Country.Value, function(x){tmap::geocode_OSM(x)$coords})),
t(sapply(df$Address.Country, function(x){tmap::geocode_OSM(x)$coords})))
# [1] 0 10448111 14794618 0 16110917 5156823 0 0 0 0
or rowwise:
apply(df, 1, function(x){distHaversine(tmap::geocode_OSM(x['Country.Value'])$coords,
tmap::geocode_OSM(x['Address.Country'])$coords)})
# [1] 0 10448111 14794618 0 16110917 5156823 0 0 0 0
and subset to the coords data. Also note that Google, DSK, and OSM all choose different centers for each country, so the resulting distances are differ by some distance.

Related

R- delete the tail word

Can someone teach me how to delete tail word ,thanks.
from
1 North Africa
2 Algeria
3 Canary Islands (Spain)[153]
4 Ceuta (Spain)[154]
to
1 North Africa
2 Algeria
3 Canary Islands
4 Ceuta
I'm sad with my poor English.
It seems that you want to trim a trailing name in parentheses, along with anything which follows to the end of the string. We can use sub for this purpose:
df <- data.frame(id=c(1:4),
places=c("North Africa", "Algeria", "Canary Islands (Spain)[153]", "Ceuta (Spain)[154]"),
stringsAsFactors=FALSE)
df$places <- sub("\\s*\\(.*\\).*$", "", df$places)
df
id places
1 1 North Africa
2 2 Algeria
3 3 Canary Islands
4 4 Ceuta

How to recode and encode a country pair variable in R

I am trying to recode a variable for country pairs, e.g. an exporter EFG and an importeur ISR equals the country pair EFGISR. I need these pairs for a panel data analysis and therefore these country pairs have to be set to numeric variables. I am familiar to the as.numeric command, however recoding these variables back to the format seems to be a tough job. Do you guys know a better way to code it or a way to use the factor variable as a referene for a recode call ? I will have to use the plm package and the command make.pballanced().
Cheers and I would really appreciate your help!
edit:
idvar <- c(BRAWLD, BRAALB, BRADZA, BRAARG, BRAAUS, BRAAUT, BRABHR, BRAARM)
as.numeric(idvar)
[1] 108 2 30 5 7 8 12 6 9 15 11 17 23 19
as.factor(idvar)
[1] 108 2 30 5 7 8 12 6 9 15 11 17 23 19
This is the part where I would like to have again
idvar
BRAWLD, BRAALB, BRADZA, BRAARG, BRAAUS, BRAAUT, BRABHR, BRAARM
I am Heading my dataset here:
year exp exp_iso imp imp_iso nw tv nw_c nw_dc tv_c tv_dc tv_total nw_total id_var
1996-BRAARE 1996 Brazil BRA United Arab Emirates ARE 563812 1245639 563812 0 1245639 0 1245639 563812 BRAARE
1996-BRAARG 1996 Brazil BRA Argentina ARG 34006800 77508984 34006800 0 77508984 0 77508984 34006800 BRAARG
1996-BRAARM 1996 Brazil BRA Armenia ARM 38398 70656 38398 0 70656 0 70656 38398 BRAARM
1996-BRAAUS 1996 Brazil BRA Australia AUS 3213000 7864554 3213000 0 7864554 0 7864554 3213000 BRAAUS
1996-BRAAUT 1996 Brazil BRA Austria AUT 11189578 25442560 11189578 0 25442560 0 25442560 11189578 BRAAUT
1996-BRABEL 1996 Brazil BRA Belgium BEL 41944172 93179224 41944172 0 93179224 0 93179224 41944172 BRABEL
I found an appealing solution to the problem. Using the package countryodes provides a formula with which I could paste the charachter country codes as numeric codes using the countrycode = "iso3n".

For loop skips rows in R dataframe

I have a for loop printing values out of this small test dataframe.
USA Finland China Sweden
1 1 3 5.505962 8.310596
2 2 4 11.033347 5.425747
3 3 5 14.932882 3.272544
4 4 6 10.155517 5.980190
5 5 7 11.020148 3.692313
Total 0 0 0.000000 0.000000
This line prints out a line from the dataframe:
print(countries[2,])
and results in this:
USA Finland China Sweden
2 2 4 11.03335 5.425747
So based on that, I imagine I could do the same in a for loop and print out all the lines. Code for the loop:
for (i in countries[1,])
{
print(countries[i,])
}
However this results in only every second line printed out which doesn't make sense. The result I get is this:
USA Finland China Sweden
1 1 3 5.505962 8.310596
USA Finland China Sweden
3 3 5 14.93288 3.272544
USA Finland China Sweden
5 5 7 11.02015 3.692313
USA Finland China Sweden
NA NA NA NA NA
What could possibly lead to this happening? I'm using R studio so could it be the console logging not keeping up with the values?
#lmo comment suggest solution. I think that you want to know why this happend, so I'll try to answer that.
You are using this code:
1: for (i in countries[1,])
2: {
3: print(countries[i,])
4: }
In line 1 you are selecting a vector of values that i will be using. This vector happens to be the first row of your data: 1 3 5.505962 8.310596. It translates to a vector c(1,3,5,8) - as indexes.
So in line 3 you are printing lines 1, 3, 5, 8 (because you choose that indexes). It was quite random that it were even rows, but I hope you understand it better.
Of course you should use df[1:5,] or print(df) instead of for.

Use DocumentTermMatrix in R with 'dictionary' parameter

I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word:
library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)
words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)
The first inspect(dtm) work as expected with result:
Terms
Docs albania azerbaijan japan korea usa
1 1 1 1 1 1
But the second inspect(test) show this result:
Terms
Docs argentina australia japan korea turkey uganda
1 0 1 0 1 0 0
While the expected result is:
Terms
Docs argentina australia japan korea turkey uganda
1 0 0 1 1 0 0
Is it a bug or I use it the wrong way ?
Corpus() seems to have a bug when indexing word frequency.
Use VCorpus() instead, this will give you the expected result.

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

Resources