Memory Issues using rnoaa package - r

Working with rnoaa package to take add US station IDs to a df of weather events. Below is str() for the rain df.
google drive link to csv file of subset
'data.frame': 4395 obs. of 63 variables:
$ YEAR : int 2009 2009 2012 2013 2013 2015 2007 2007 2007
$ msa_code : int 29180 29180 29180 12260 12260 12260 23540 23540
$ zip : int 22001 22001 22001 45003 45003 45003 12001 12001
$ state : chr "LA" "LA" "LA" "SC" ...
$ gdp : int 23495 23495 27346 20856 20856 22313 10119 10119
$ EVENT_TYPE : chr "Heavy Rain" "Heavy Rain" "Heavy Rain" "Heavy
$ WFO : chr "LCH" "LCH" "LCH" "CAE" ...
$ latitude : num 30.4 30.2 30.2 33.4 33.5 ...
$ longitude : num -92.4 -92.4 -92.2 -81.6 -81.9 ...
$ SUM_DAMAGES : num 0 0 0 0 0 0 0 0 0 0 ...
Omitting a bunch of variables that aren't relevant to this, here is a snippet of the rain df
X CZ_NAME YEAR full state name msa_code msa_name.x zip
49 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
60 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
91 ACADIA 2012 LOUISIANA 29180 Lafayette, LA 22001
761 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
770 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
809 AIKEN 2015 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
latitude longitude
-92.4200 30.4300
-92.3700 30.2200
-92.2484 30.2354
-81.6400 33.4361
-81.8800 33.5400
-81.7000 33.5300
Here is a snippet of the ghcnd_stations() tibble, which the rnoaa documentation recommends assigning so it doesn't have to call it each time.
# A tibble: 6 × 11
id latitude longitude elevation state name
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
2 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
3 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
4 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
5 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
6 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
# ... with 5 more variables: gsn_flag <chr>, wmo_id <chr>, element <chr>,
# first_year <int>, last_year <int>
So far I've been able to use the ghcnd_stations() command to call up a list of stations, then, after removing non-CONUS stations, taking the lat/lon coordinates of those stations, use fuzzyjoin::geo_inner_join to compare the two lists and merge in the closest stations.
subset <- head(rain)
subset_join <- geo_inner_join(subset, stations, by = c("latitude", "longitude"), max_dist = 5)
I took a subset of my data and tried to run this and it works, but when I try to run that code on the entire dataset I'm faced with memory.size errors:
Error: cannot allocate vector of size 2.9 Gb
In addition: Warning messages:
1: In fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, :
Reached total allocation of 8017Mb: see help(memory.size)
I've tried uisng memory.size = 9000, and tried to read up on upping memory size, but I'm still receiving an error. memory.size(max = TRUE) returns this:
> memory.size(max = TRUE)
[1] 7013
Is there a more efficient way to do this, or am I going to have to slice up my df, run the code, and then rbind it back together?
Just for context, here is sys.info()
Sys.info()
sysname release version nodename
"Windows" ">= 8 x64" "build 9200" "DESKTOP-G88LPOJ"
machine login user effective_user
"x86-64" "franc" "franc" "franc"
First question! Let me know if I haven't included anything relevant.
Thanks!

Related

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?
I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1
Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

Convert one column into multiple columns

I am a novice. I have a data set with one column and many rows. I want to convert this column into 5 columns. For example my data set looks like this:
Column
----
City
Nation
Area
Metro Area
Urban Area
Shanghai
China
24,000,000
1230040
4244234
New york
America
343423
23423434
343434
Etc
The output should look like this
City | Nation | Area | Metro City | Urban Area
----- ------- ------ ------------ -----------
Shangai China 2400000 1230040 4244234
New york America 343423 23423434 343434
The first 5 rows of the data set (City, Nation,Area, etc) need to be the names of the 5 columns and i want the rest of the data to get populated under these 5 columns. Please help.
Here is a one liner (considering that your column is character, i.e. df$column <- as.character(df$column))
setNames(data.frame(matrix(unlist(df[-c(1:5),]), ncol = 5, byrow = TRUE)), c(unlist(df[1:5,])))
# City Nation Area Metro_Area Urban_Area
#1 Shanghai China 24,000,000 1230040 4244234
#2 New_york America 343423 23423434 343434
I'm going to go out on a limb and guess that the data you're after is from the URL: https://en.wikipedia.org/wiki/List_of_largest_cities.
If this is the case, I would suggest you actually try re-reading the data (not sure how you got the data into R in the first place) since that would probably make your life easier.
Here's one way to read the data in:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
XPATH <- '//*[#id="mw-content-text"]/table[2]'
cities <- URL %>%
read_html() %>%
html_nodes(xpath=XPATH) %>%
html_table(fill = TRUE)
Here's what the data currently looks like. Still needs to be cleaned up (notice that some of the columns which had names in merged cells from "rowspan" and the sorts):
head(cities[[1]])
## City Nation Image Population Population Population
## 1 Image City proper Metropolitan area Urban area[7]
## 2 Shanghai China 24,256,800[8] 34,750,000[9] 23,416,000[a]
## 3 Karachi Pakistan 23,500,000[10] 25,400,000[11] 25,400,000
## 4 Beijing China 21,516,000[12] 24,900,000[13] 21,009,000
## 5 Dhaka Bangladesh 16,970,105[14] 15,669,000 18,305,671[15][not in citation given]
## 6 Delhi India 16,787,941[16] 24,998,000 21,753,486[17]
From there, the cleanup might be like:
cities <- cities[[1]][-1, ]
names(cities) <- c("City", "Nation", "Image", "Pop_City", "Pop_Metro", "Pop_Urban")
cities["Image"] <- NULL
head(cities)
cities[] <- lapply(cities, function(x) type.convert(gsub("\\[.*|,", "", x)))
head(cities)
# City Nation Pop_City Pop_Metro Pop_Urban
# 2 Shanghai China 24256800 34750000 23416000
# 3 Karachi Pakistan 23500000 25400000 25400000
# 4 Beijing China 21516000 24900000 21009000
# 5 Dhaka Bangladesh 16970105 15669000 18305671
# 6 Delhi India 16787941 24998000 21753486
# 7 Lagos Nigeria 16060303 13123000 21000000
str(cities)
# 'data.frame': 163 obs. of 5 variables:
# $ City : Factor w/ 162 levels "Abidjan","Addis Ababa",..: 133 74 12 41 40 84 66 148 53 102 ...
# $ Nation : Factor w/ 59 levels "Afghanistan",..: 13 41 13 7 25 40 54 31 13 25 ...
# $ Pop_City : num 24256800 23500000 21516000 16970105 16787941 ...
# $ Pop_Metro: int 34750000 25400000 24900000 15669000 24998000 13123000 13520000 37843000 44259000 17712000 ...
# $ Pop_Urban: num 23416000 25400000 21009000 18305671 21753486 ...

Error in prettyDate range too small for min.n in ggplot2 facet chart

I have a dataset with all the olympics results in athletics.
I need to make a facet ggplot with different categories, for example 100m and Marathon so i subset:
ath.sub <- subset(ath, Event_STD%in%c('100m','Marathon'))
i got this dataframe
> head(ath.sub)
Event Event_STD Athlete Country Result Medal YEAR unit Sex time
1261 100m Men 100m Usain Bolt JAM 9.69 GOLD 2008 time Men 2011-01-01
1262 100m Men 100m Donovan Bailey CAN 9.84 GOLD 1996 time Men 2011-01-01
1263 100m Men 100m Justin Gatlin USA 9.85 GOLD 2004 time Men 2011-01-01
1264 100m Men 100m Francis Obikwelu POR 9.86 SILVER 2004 time Men 2011-01-01
1265 100m Men 100m Maurice Greene USA 9.87 GOLD 2000 time Men 2011-01-01
1266 100m Men 100m Maurice Greene USA 9.87 BRONZE 2004 time Men 2011-01-01
> tail(ath.sub)
Event Event_STD Athlete Country Result Medal YEAR unit Sex time
3370 Marathon Women Marathon Valentina Yegorova RUS 2:28.05 SILVER 1996 time Women 2011-01-01 02:28:00
3371 Marathon Women Marathon Yuko Arimori JPN 2:28.39 BRONZE 1996 time Women 2011-01-01 02:28:00
3372 Marathon Women Marathon Valentina Yegorova URS 2:32:41 GOLD 1992 time Women 2011-01-01 02:32:00
3373 Marathon Women Marathon Yuko Arimori JPN 2:32:49 SILVER 1992 time Women 2011-01-01 02:32:00
3374 Marathon Women Marathon Lorraine Moller NZL 2:33.59 BRONZE 1992 time Women 2011-01-01 02:33:00
3375 Marathon Women Marathon Catherine Ndereba KEN <NA> SILVER 2008 time Women <NA>
> str(ath.sub)
'data.frame': 236 obs. of 10 variables:
$ Event : chr "100m Men" "100m Men" "100m Men" "100m Men" ...
$ Event_STD: chr "100m" "100m" "100m" "100m" ...
$ Athlete : chr "Usain Bolt" "Donovan Bailey" "Justin Gatlin" "Francis Obikwelu" ...
$ Country : chr "JAM" "CAN" "USA" "POR" ...
$ Result : chr "9.69" "9.84" "9.85" "9.86" ...
$ Medal : chr "GOLD" "GOLD" "GOLD" "SILVER" ...
$ YEAR : int 2008 1996 2004 2004 2000 2004 1996 2008 1996 2008 ...
$ unit : chr "time" "time" "time" "time" ...
$ Sex : chr "Men" "Men" "Men" "Men" ...
$ time : chr "2011-01-01 00:00:09.69" "2011-01-01 00:00:09.84" "2011-01-01 00:00:09.85" "2011-01-01 00:00:09.86" ...
then i convert the time field in posixct
> ath.sub$time<-as.POSIXct(ath.sub$time,tz = 'GMT')
> str(ath.sub$time)
POSIXct[1:236], format: "2011-01-01 00:00:00" "2011-01-01 00:00:00" "2011-01-01 00:00:00" "2011-01-01 00:00:00" ...
As i wrote before i need to make a ggplot facet line chart.
If i choose similar disciplines (like 100m or 400m) i've no problems.
But with different time discipliens like 100m and marathon i got this error
Error in prettyDate(x = x, n = n, min.n = min.n, sep = sep, ...) :
range too small for 'min.n'
here is the ggplot code
gg.ath<- ggplot(ath.sub, aes( YEAR, time, colour=Sex))+
facet_wrap(~Event_STD, scales = 'free')+
scale_y_datetime()+
scale_x_continuous(breaks = ath.sub$YEAR)+
geom_line()+
geom_smooth()
My colleague fixed it, she used lubridate package while converting the time field
ath.sub$time <- lubridate::ymd_hms(ath.sub$time)

r: dplyr mutate error non-numeric argument to binary operator

trying mutate in dplyr on the data.frame (list) but get an error: non-numeric argument to binary operator. tried converting delayed and 'on time' to numeric but still getting the error, is there an error in the code?
list$delayed <- as.numeric(as.character(list$delayed))
list$'on time' <- as.numeric(as.character(list$'on time'))
list <- mutate(list, total = delayed + 'on tine', pctdelay = delayed / total * 100)
Carrier City delayed on time
1 Alaska Los Angeles 62 497
2 Alaska Phoenix 12 221
3 Alaska San Diego 20 212
4 Alaska San Francisco 102 503
5 Alaska Seattle 305 1841
6 AM WEST Los Angeles 117 694

stuck making a data frame after using street2coordinates (R)

I am trying to follow the tutorial outlined here but having trouble
But I am running into a problem at this step:
my_crime <- data.frame(year=my_crime$Year, community=my_crime$Community.Area,
type=my_crime$Primary.Type, arrest=my_crime$Arrest,
latitude=my_crime$Latitude, longitude=my_crime$Longitude)
My equivalent step is:
geocode <- data.frame(latitude=geocode$lat, longitude=geocode$long)
I get the following error:
Error in geocode$lat : $ operator is invalid for atomic vectors
I made the geocode dataset by sending a list of addresses to the street2coordinates website and getting back a list of long/lats (as outlined here) It seems that something is wrong with the dataset I created coming out of that. Here is the part where I make geocode:
data2 <- paste0("[",paste(paste0("\"",fm$V2,"\""),collapse=","),"]")
data2
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data2)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json,
function(x) c(address=paste(x$street_address, x$locality, x$region), long=x$longitude,lat=x$latitude)))
geocode
Thank you for any and all help!
Results of str(geocode) after the first do.call (I altered the addresses):
chr [1:2, 1:3] "123 Main St Anytown MA" "669 Main St Anytown MA" "-65.5" "-33.4" "22.1" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] " 123 Main St Anytown MA" " 669 Main St Anytown MA"
..$ : chr [1:3] "address" "long" "lat"
Or you can use the RDSTK package and do the same thing:
library(RDSTK)
data <- c("1208 Buckingham Drive, Crystal Lake, IL 60014",
"9820 State Street East, Paducah, KY 42001",
"685 Park Place, Saint Petersburg, FL 33702",
"5316 4th Avenue, Charlotte, NC 28205",
"2994 Somerset Drive, Baldwinsville, NY 13027",
"5457 5th Street South, Tallahassee, FL 32303")
geocode <- do.call(rbind, lapply(data, street2coordinates))
geocode
## full.address country_code3 latitude
## 1 1208 Buckingham Drive, Crystal Lake, IL 60014 USA 42.21893
## 2 9820 State Street East, Paducah, KY 42001 USA 36.50045
## 3 685 Park Place, Saint Petersburg, FL 33702 USA 27.96470
## 4 5316 4th Avenue, Charlotte, NC 28205 USA 35.22241
## 5 2994 Somerset Drive, Baldwinsville, NY 13027 USA 42.94575
## 6 5457 5th Street South, Tallahassee, FL 32303 USA 30.45489
## country_name longitude street_address region confidence
## 1 United States -88.33914 474 Buckingham Dr IL 0.805
## 2 United States -88.32971 498 State St KY 0.551
## 3 United States -82.79733 685 Park St FL 0.721
## 4 United States -80.80540 1698 Firth Ct NC 0.512
## 5 United States -76.56455 98 Somerset Ave NY 0.537
## 6 United States -84.29354 699 W 5th Ave FL 0.610
## street_number locality street_name fips_county country_code
## 1 474 Crystal Lake Buckingham Dr 17111 US
## 2 498 Hazel State St 21035 US
## 3 685 Clearwater Park St 12103 US
## 4 1698 Charlotte Firth Ct 37119 US
## 5 98 Auburn Somerset Ave 36011 US
## 6 699 Tallahassee W 5th Ave 12073 US
Currently, your do.call creates a matrix (using rbind and c), coercing all the numeric into characters.
The following should turn your list "json" into the data.frame "geocode" will the information you need, i.e. "address", "long" and "lat".
foo <- function(x) data.frame(address=paste(x$street_address, x$locality,
x$region), long=x$longitude,lat=x$latitude)
geocode <- do.call(rbind, sapply(json, foo))

Resources