stuck making a data frame after using street2coordinates (R) - r

I am trying to follow the tutorial outlined here but having trouble
But I am running into a problem at this step:
my_crime <- data.frame(year=my_crime$Year, community=my_crime$Community.Area,
type=my_crime$Primary.Type, arrest=my_crime$Arrest,
latitude=my_crime$Latitude, longitude=my_crime$Longitude)
My equivalent step is:
geocode <- data.frame(latitude=geocode$lat, longitude=geocode$long)
I get the following error:
Error in geocode$lat : $ operator is invalid for atomic vectors
I made the geocode dataset by sending a list of addresses to the street2coordinates website and getting back a list of long/lats (as outlined here) It seems that something is wrong with the dataset I created coming out of that. Here is the part where I make geocode:
data2 <- paste0("[",paste(paste0("\"",fm$V2,"\""),collapse=","),"]")
data2
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data2)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json,
function(x) c(address=paste(x$street_address, x$locality, x$region), long=x$longitude,lat=x$latitude)))
geocode
Thank you for any and all help!
Results of str(geocode) after the first do.call (I altered the addresses):
chr [1:2, 1:3] "123 Main St Anytown MA" "669 Main St Anytown MA" "-65.5" "-33.4" "22.1" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] " 123 Main St Anytown MA" " 669 Main St Anytown MA"
..$ : chr [1:3] "address" "long" "lat"

Or you can use the RDSTK package and do the same thing:
library(RDSTK)
data <- c("1208 Buckingham Drive, Crystal Lake, IL 60014",
"9820 State Street East, Paducah, KY 42001",
"685 Park Place, Saint Petersburg, FL 33702",
"5316 4th Avenue, Charlotte, NC 28205",
"2994 Somerset Drive, Baldwinsville, NY 13027",
"5457 5th Street South, Tallahassee, FL 32303")
geocode <- do.call(rbind, lapply(data, street2coordinates))
geocode
## full.address country_code3 latitude
## 1 1208 Buckingham Drive, Crystal Lake, IL 60014 USA 42.21893
## 2 9820 State Street East, Paducah, KY 42001 USA 36.50045
## 3 685 Park Place, Saint Petersburg, FL 33702 USA 27.96470
## 4 5316 4th Avenue, Charlotte, NC 28205 USA 35.22241
## 5 2994 Somerset Drive, Baldwinsville, NY 13027 USA 42.94575
## 6 5457 5th Street South, Tallahassee, FL 32303 USA 30.45489
## country_name longitude street_address region confidence
## 1 United States -88.33914 474 Buckingham Dr IL 0.805
## 2 United States -88.32971 498 State St KY 0.551
## 3 United States -82.79733 685 Park St FL 0.721
## 4 United States -80.80540 1698 Firth Ct NC 0.512
## 5 United States -76.56455 98 Somerset Ave NY 0.537
## 6 United States -84.29354 699 W 5th Ave FL 0.610
## street_number locality street_name fips_county country_code
## 1 474 Crystal Lake Buckingham Dr 17111 US
## 2 498 Hazel State St 21035 US
## 3 685 Clearwater Park St 12103 US
## 4 1698 Charlotte Firth Ct 37119 US
## 5 98 Auburn Somerset Ave 36011 US
## 6 699 Tallahassee W 5th Ave 12073 US

Currently, your do.call creates a matrix (using rbind and c), coercing all the numeric into characters.
The following should turn your list "json" into the data.frame "geocode" will the information you need, i.e. "address", "long" and "lat".
foo <- function(x) data.frame(address=paste(x$street_address, x$locality,
x$region), long=x$longitude,lat=x$latitude)
geocode <- do.call(rbind, sapply(json, foo))

Related

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

How to have bar labels be names in Plotly for R

So I'm trying to make a bar chart that displays the most popular airports that flew to Chicago. For some reason, I'm finding it to be extremely difficult to have my bars be labeled by the airport names specifically.
I have a data frame called ty
> ty
Name
1 Atlanta, GA: Hartsfield-Jackson Atlanta International
2 New York, NY: LaGuardia
3 Minneapolis, MN: Minneapolis-St Paul International
4 Los Angeles, CA: Los Angeles International
5 Denver, CO: Denver International
6 Washington, DC: Ronald Reagan Washington National
7 Orlando, FL: Orlando International
8 Phoenix, AZ: Phoenix Sky Harbor International
9 Detroit, MI: Detroit Metro Wayne County
10 Las Vegas, NV: McCarran International
11 San Francisco, CA: San Francisco International
12 Dallas/Fort Worth, TX: Dallas/Fort Worth International
13 Boston, MA: Logan International
14 Philadelphia, PA: Philadelphia International
15 Newark, NJ: Newark Liberty International
I also have a data frame called df
id numArrivals
1 10397 964
2 12953 962
3 13487 883
4 12892 823
5 11292 776
6 11278 771
7 13204 725
8 14107 700
9 11433 672
10 12889 647
11 14771 611
12 11298 580
13 10721 569
14 14100 567
15 11618 488
The id corresponds to the airport name 10397 is Atlanta, GA: Hartsfield-Jackson Atlanta International and they continue in that order.
However, when I run:
plotly::plot_ly(df,x=ty["Name"],y=df$numArrivals,type="bar",color=I("rgba(0,92,124,1)"))
I am given this chart.
How can I make the labels of my bars into the names of the airport rather than just numbers?
Feel free to use ggplotly() to create your plot. I used the code below to create a small example.
example <- data.frame(airport = c("Atlanta, GA: Hartsfield-Jackson Atlanta International","New York, NY: LaGuardia","Minneapolis, MN: Minneapolis-St Paul International"),
id = c(10397,12953,13487),
numArrivals = c(964,962,883),stringsAsFactors = F)
library(ggplot2)
library(plotly)
a <- ggplot(example,aes(x=airport,y=numArrivals,fill=id)) + geom_bar(stat = "identity") + coord_flip()
ggplotly(a)
The final result looks like this.

Convert one column into multiple columns

I am a novice. I have a data set with one column and many rows. I want to convert this column into 5 columns. For example my data set looks like this:
Column
----
City
Nation
Area
Metro Area
Urban Area
Shanghai
China
24,000,000
1230040
4244234
New york
America
343423
23423434
343434
Etc
The output should look like this
City | Nation | Area | Metro City | Urban Area
----- ------- ------ ------------ -----------
Shangai China 2400000 1230040 4244234
New york America 343423 23423434 343434
The first 5 rows of the data set (City, Nation,Area, etc) need to be the names of the 5 columns and i want the rest of the data to get populated under these 5 columns. Please help.
Here is a one liner (considering that your column is character, i.e. df$column <- as.character(df$column))
setNames(data.frame(matrix(unlist(df[-c(1:5),]), ncol = 5, byrow = TRUE)), c(unlist(df[1:5,])))
# City Nation Area Metro_Area Urban_Area
#1 Shanghai China 24,000,000 1230040 4244234
#2 New_york America 343423 23423434 343434
I'm going to go out on a limb and guess that the data you're after is from the URL: https://en.wikipedia.org/wiki/List_of_largest_cities.
If this is the case, I would suggest you actually try re-reading the data (not sure how you got the data into R in the first place) since that would probably make your life easier.
Here's one way to read the data in:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
XPATH <- '//*[#id="mw-content-text"]/table[2]'
cities <- URL %>%
read_html() %>%
html_nodes(xpath=XPATH) %>%
html_table(fill = TRUE)
Here's what the data currently looks like. Still needs to be cleaned up (notice that some of the columns which had names in merged cells from "rowspan" and the sorts):
head(cities[[1]])
## City Nation Image Population Population Population
## 1 Image City proper Metropolitan area Urban area[7]
## 2 Shanghai China 24,256,800[8] 34,750,000[9] 23,416,000[a]
## 3 Karachi Pakistan 23,500,000[10] 25,400,000[11] 25,400,000
## 4 Beijing China 21,516,000[12] 24,900,000[13] 21,009,000
## 5 Dhaka Bangladesh 16,970,105[14] 15,669,000 18,305,671[15][not in citation given]
## 6 Delhi India 16,787,941[16] 24,998,000 21,753,486[17]
From there, the cleanup might be like:
cities <- cities[[1]][-1, ]
names(cities) <- c("City", "Nation", "Image", "Pop_City", "Pop_Metro", "Pop_Urban")
cities["Image"] <- NULL
head(cities)
cities[] <- lapply(cities, function(x) type.convert(gsub("\\[.*|,", "", x)))
head(cities)
# City Nation Pop_City Pop_Metro Pop_Urban
# 2 Shanghai China 24256800 34750000 23416000
# 3 Karachi Pakistan 23500000 25400000 25400000
# 4 Beijing China 21516000 24900000 21009000
# 5 Dhaka Bangladesh 16970105 15669000 18305671
# 6 Delhi India 16787941 24998000 21753486
# 7 Lagos Nigeria 16060303 13123000 21000000
str(cities)
# 'data.frame': 163 obs. of 5 variables:
# $ City : Factor w/ 162 levels "Abidjan","Addis Ababa",..: 133 74 12 41 40 84 66 148 53 102 ...
# $ Nation : Factor w/ 59 levels "Afghanistan",..: 13 41 13 7 25 40 54 31 13 25 ...
# $ Pop_City : num 24256800 23500000 21516000 16970105 16787941 ...
# $ Pop_Metro: int 34750000 25400000 24900000 15669000 24998000 13123000 13520000 37843000 44259000 17712000 ...
# $ Pop_Urban: num 23416000 25400000 21009000 18305671 21753486 ...

Memory Issues using rnoaa package

Working with rnoaa package to take add US station IDs to a df of weather events. Below is str() for the rain df.
google drive link to csv file of subset
'data.frame': 4395 obs. of 63 variables:
$ YEAR : int 2009 2009 2012 2013 2013 2015 2007 2007 2007
$ msa_code : int 29180 29180 29180 12260 12260 12260 23540 23540
$ zip : int 22001 22001 22001 45003 45003 45003 12001 12001
$ state : chr "LA" "LA" "LA" "SC" ...
$ gdp : int 23495 23495 27346 20856 20856 22313 10119 10119
$ EVENT_TYPE : chr "Heavy Rain" "Heavy Rain" "Heavy Rain" "Heavy
$ WFO : chr "LCH" "LCH" "LCH" "CAE" ...
$ latitude : num 30.4 30.2 30.2 33.4 33.5 ...
$ longitude : num -92.4 -92.4 -92.2 -81.6 -81.9 ...
$ SUM_DAMAGES : num 0 0 0 0 0 0 0 0 0 0 ...
Omitting a bunch of variables that aren't relevant to this, here is a snippet of the rain df
X CZ_NAME YEAR full state name msa_code msa_name.x zip
49 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
60 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
91 ACADIA 2012 LOUISIANA 29180 Lafayette, LA 22001
761 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
770 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
809 AIKEN 2015 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
latitude longitude
-92.4200 30.4300
-92.3700 30.2200
-92.2484 30.2354
-81.6400 33.4361
-81.8800 33.5400
-81.7000 33.5300
Here is a snippet of the ghcnd_stations() tibble, which the rnoaa documentation recommends assigning so it doesn't have to call it each time.
# A tibble: 6 × 11
id latitude longitude elevation state name
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
2 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
3 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
4 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
5 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
6 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
# ... with 5 more variables: gsn_flag <chr>, wmo_id <chr>, element <chr>,
# first_year <int>, last_year <int>
So far I've been able to use the ghcnd_stations() command to call up a list of stations, then, after removing non-CONUS stations, taking the lat/lon coordinates of those stations, use fuzzyjoin::geo_inner_join to compare the two lists and merge in the closest stations.
subset <- head(rain)
subset_join <- geo_inner_join(subset, stations, by = c("latitude", "longitude"), max_dist = 5)
I took a subset of my data and tried to run this and it works, but when I try to run that code on the entire dataset I'm faced with memory.size errors:
Error: cannot allocate vector of size 2.9 Gb
In addition: Warning messages:
1: In fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, :
Reached total allocation of 8017Mb: see help(memory.size)
I've tried uisng memory.size = 9000, and tried to read up on upping memory size, but I'm still receiving an error. memory.size(max = TRUE) returns this:
> memory.size(max = TRUE)
[1] 7013
Is there a more efficient way to do this, or am I going to have to slice up my df, run the code, and then rbind it back together?
Just for context, here is sys.info()
Sys.info()
sysname release version nodename
"Windows" ">= 8 x64" "build 9200" "DESKTOP-G88LPOJ"
machine login user effective_user
"x86-64" "franc" "franc" "franc"
First question! Let me know if I haven't included anything relevant.
Thanks!

Convert a list into a dataframe with <0 rows> (or 0-length row.names)

I am using RDSTK package to convert address to lat/lon. I want to convert the following list into a dataframe. Not sure how to deal with the third element I got. Here is the list:
[[1]]
full.address country_code3 latitude country_name longitude street_address region
1 25462 Alabama Hwy. 127 35620 Elkmont AL 35620 USA 34.92968 United States -86.98871 25462 State Rte 127 AL
confidence street_number locality street_name fips_county country_code
1 0.791 25462 Elkmont State Rte 127 01083 US
[[2]]
full.address country_code3 latitude country_name longitude street_address region
1 270 Industrial Blvd. 35982 Leesburg AL 35982 USA 33.99676 United States -86.11737 270 Industrial Blvd SE AL
confidence street_number locality street_name fips_county country_code
1 0.678 270 Attalla Industrial Blvd SE 01055 US
[[3]]
[1] full.address
<0 rows> (or 0-length row.names)
[[4]]
full.address country_code3 latitude country_name longitude street_address region confidence
1 934 Adams Avenue 36104 Montgomery AL 36104 USA 32.37545 United States -86.29605 934 Adams Ave AL 0.883
street_number locality street_name fips_county country_code
1 934 Montgomery Adams Ave 01101 US
[[5]]
full.address country_code3 latitude country_name longitude street_address region confidence
1 8189 Vaughn Road 36116 Montgomery AL 36116 USA 32.33882 United States -86.17086 8189 Vaughn Rd AL 0.883
street_number locality street_name fips_county country_code
1 8189 Montgomery Vaughn Rd 01101 US
The third element appears as <0 rows> (or 0-length row.names).
What I want to achieve is
full.address country_code3 latitude country_name longitude street_address region
1 25462 Alabama Hwy. 127 35620 Elkmont AL 35620 USA 34.92968 United States -86.98871 25462 State Rte 127 AL
2 270 Industrial Blvd. 35982 Leesburg AL 35982 USA 33.99676 United States -86.11737 270 Industrial Blvd SE AL
3 NA NA NA
4 934 Adams Avenue 36104 Montgomery AL 36104 USA 32.37545 United States -86.29605 934 Adams Ave AL
5 8189 Vaughn Road 36116 Montgomery AL 36116 USA 32.33882 United States -86.17086 8189 Vaughn Rd AL
confidence street_number locality street_name fips_county country_code
1 0.791 25462 Elkmont State Rte 127 01083 US
2 0.678 270 Attalla Industrial Blvd SE 01055 US
3
NA NA NA NA NA NA
4 0.883 934 Montgomery Adams Ave 01101 US
5 0.883 8189 Montgomery Vaughn Rd 01101 US
Here is what I got with dput:
> dput(geocode[[3]])
structure(list(full.address = character(0)), .Names = "full.address", row.names = integer(0), class = "data.frame")
The problem is that rbind ignores any empty data.frames and doesn;t add them.
So, we can change your data so that what was empty is now NA:
geocode <- lapply(geocode, function(x) if(nrow(x)==0) NA else x)
Then we can use rbind:
do.call(rbind, geocode)
full.address blah
1 <NA> <NA>
2 a b
data used:
list(structure(list(full.address = character(0)), .Names = "full.address", row.names = integer(0), class = "data.frame"),
structure(list(full.address = structure(1L, .Label = "a", class = "factor"),
blah = structure(1L, .Label = "b", class = "factor")), .Names = c("full.address",
"blah"), row.names = c(NA, -1L), class = "data.frame"))

Resources