Sorry if this is repetitive, but I've looked everywhere and can't seem to find anything that addresses my specific problem in R. I have a column with city names:
cities <-data.frame(c("Sydney", "Dusseldorf", "LidCombe", "Portland"))
colnames(cities)[1]<-"CityName"
Ideally I'd like to attach a column with either the lat/long for each city or the time zone. I have tried using the "ggmap" package in R, but my request exceeds the maximum number of requests they allow per day. I found the "geonames" package that converts lat/long to timezones, so if I get the lat/long for the city I should be able to take it from there.
Edit to address potential duplicate question: I would like to do this without using the ggmap package, as I have too many rows and they have a maximum # of requests per day.
You can get at least many major cities from the world.cities data in the maps package.
## Changing your data to a vector
cities <- c("Sydney", "Dusseldorf", "LidCombe", "Portland")
## Load up data
library(maps)
data(world.cities)
world.cities[match(cities, world.cities$name), ]
name country.etc pop lat long capital
36817 Sydney Australia 4444513 -33.87 151.21 0
10026 Dusseldorf Germany 573521 51.24 6.79 0
NA <NA> <NA> NA NA NA NA
29625 Portland Australia 8757 -38.34 141.59 0
Note: LidCombe was not included.
Warning: For many names, there is more than one world city. For example,
world.cities[grep("Portland", world.cities$name), ]
name country.etc pop lat long capital
29625 Portland Australia 8757 -38.34 141.59 0
29626 Portland USA 542751 45.54 -122.66 0
29627 Portland USA 62882 43.66 -70.28 0
Of course the two in the USA are Portland, Maine and Portland, Oregon.
match is just giving the first one on the list. You may need to use more information than just the name to get a good result.
Related
I'm working on a problem where I'm trying to map each state to a region for some data analysis. It seems the first thing I need to do is create a dataframe containing the names of all 50 states. Is there a way to do this without explicitly naming each state and inputting it into a row in the dataframe?
Sample data:
region_key <- as.data.frame("")
colnames(region_key) <- c("state")
region_key$region <- ""
region_key$state <- "AL"
I create an empty data frame, create a "state" and "region" column, then populate the state two letter abbreviations in the above fashion. Is there a way to both populate the data frame with the state abbreviations and classify by region (e.g. Alabama would be "South")?
Expected output:
head(region_key)
state region
1 AL South
Thanks in advance for your help!
Figured out my problem based on the comment from #alistair, thank you.
Sample data:
region_key <- data.frame(state.abb, state.region)
head(region_key)
state.abb state.region
1 AL South
2 AK West
3 AZ West
4 AR South
5 CA West
6 CO West
I'm currently working with an R script set up to use RDSTK, a wrapper for the Data Science Toolkit API based on this, to geocode a list of addresses from a CSV.
The script appears to work, but the list of addresses has a preexisting unique identifier which isn't preserved in the process - the input file has two columns: id, and address. The id column, for the purposes of the geocoding process, is meaningless, but I'd like the output to retain it - that is, I'd like the output, which has three columns (address, long, and lat) to have four - id being the first.
The issue is that
The output is not in the same order as the input addresses, or doesn't appear to be, so I cannot simply tack on the column of addresses at the end, and
The output does not include nulls, so the two would not be the same number of rows in any case, even if it was the same order, and
I am not sure how to effectively tie the id column in such that it becomes a part of the geocoding process, which obviously would be the ideal solution.
Here is the script:
require("RDSTK")
library(httr)
library(rjson)
dff = read.csv("C:/Users/name/Documents/batchtestv2.csv")
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json, function(x) c(long=x$longitude,lat=x$latitude)))
geocode
write.csv(geocode, file = "C:/Users/name/Documents/geocodetest.csv")
And here is a sample of the output:
2633 Camino Ramon Suite 500 San Ramon California 94583 United States -121.96208 37.77027
555 Lordship Boulevard Stratford Connecticut 6615 United States -73.14098 41.16542
500 West 13th Street Fort Worth Texas 76102 United States -97.33288 32.74782
50 North Laura Street Suite 2500 Jacksonville Florida 32202 United States -81.65923 30.32733
7781 South Little Egypt Road Stanley North Carolina 28164 United States -81.00597 35.44482
Maybe the solution is extraordinarily simple and I'm just being dense - it's entirely possible (I don't have extensive experience with any particular language, so I sometimes miss obvious things) but I haven't been able to solve it.
Thanks in advance!
I am wanting to extract strings from elements in a data frame. Having gone through numerous previous questions, I am still unable to understand what to do! This is what I have tried to do so far:
unlist(strsplit(pcode2$Postcode,"'"))
I get the following error:
Error in strsplit(pcode2$Postcode, "'") : non-character argument
which I understand because I am trying to reference the data rather than putting the text in the code itself. I have 16,000 cases in a dataframe so also not sure how to vectorise the operation.
Any help would be greatly appreciated.
Data:
Postcode Locality State Latitude Longitude
1 ('200', Australian National University ACT -35.280, 149.120),
2 ('221', Barton ACT -35.200, 149.100),
3 ('3030', Werribee VIC -12.800, 130.960),
4 ('3030', Point Cook VIC -12.800, 130.960),
I want to get rid of the commas and braces etc so that I am left with the numeric part of Column 1 which is Postcode, numeric part of Latitude andLongitude. This is how the I am hoping the final result will look like:
Postcode Locality State Latitude Longitude
1 200 Australian National University ACT -35.280 149.120
2 221 Barton ACT -35.200 149.100
3 3030 Werribee VIC -12.800 130.960
4 3030 Point Cook VIC -12.800 130.960
Lastly, I would also like to understand how to nicely format the data in the questions.
I'm a newbie to R programming..I have a csv file contains items by country, life expectancy and region. And I've to do the following:
List out no. of countries regionwise & draw bar chart
Draw boxplot for each region
Cluster countries based on life expectancy using k-means algorithm
Name the countries that have the min & max life expectancy.
input.csv
Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe
What I did?
1.
mydata <- read.table("input.csv", header=TRUE, sep=",")
barplot(data$ncol(Region))
and I get the error Error in barplot(mydata$ncol(Region)) : attempt to apply non-function
boxplot(LifeExpectancy~Region,mydata=data) ##This is correct
3 Have no idea how to do this!
4.min(mydata$LifeExpectancy);max(mydata$LifeExpectancy) ##This is correct
As I pointed out in my comments, this question is really multiple questions, and does not reflect the title. In future, please try to keep questions manageable and discrete. I'm not going to attempt to answer your third point (about K-means clustering) here. Search SO and I'm sure you will find some relevant questions/answers.
Regarding your other questions, have a careful look at the following. If you don't understand what a particular function is doing, refer to ?function_name (e.g. ?tapply), and for further enlightenment, run nested code from the inside out (e.g. for foo(bar(baz(x))), you could examine baz(x), then bar(baz(x)), and finally foo(bar(baz(x))). This is an easy way to help you get a handle on what's going on, and is also useful when debugging code that produces errors.
d <- read.csv(text='Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe', header=TRUE)
barplot(with(d, tapply(Country, Region, length)), cex.names=0.8,
ylab='No. of countries', xlab='Region', las=1)
boxplot(LifeExpectancy ~ Region, data=d, las=1,
xlab='Region', ylab='Life expectancy')
d$Country[which.min(d$LifeExpectancy)]
# [1] India
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA
d$Country[which.max(d$LifeExpectancy)]
# [1] Switzerland
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA
Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)