r: dplyr mutate error non-numeric argument to binary operator - r

trying mutate in dplyr on the data.frame (list) but get an error: non-numeric argument to binary operator. tried converting delayed and 'on time' to numeric but still getting the error, is there an error in the code?
list$delayed <- as.numeric(as.character(list$delayed))
list$'on time' <- as.numeric(as.character(list$'on time'))
list <- mutate(list, total = delayed + 'on tine', pctdelay = delayed / total * 100)
Carrier City delayed on time
1 Alaska Los Angeles 62 497
2 Alaska Phoenix 12 221
3 Alaska San Diego 20 212
4 Alaska San Francisco 102 503
5 Alaska Seattle 305 1841
6 AM WEST Los Angeles 117 694

Related

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35
You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

Showing multiple columns in aggregate function including strings/characters in R

R noob question here.
Let's say I have this data frame:
City State Pop
Fresno CA 494
San Franciso CA 805
San Jose CA 945
San Diego CA 1307
Los Angeles CA 3792
Reno NV 225
Henderson NV 257
Las Vegas NV 583
Gresham OR 105
Salem OR 154
Eugene OR 156
Portland OR 583
Fort Worth TX 741
Austin TX 790
Dallas TX 1197
San Antonio TX 1327
Houston TX 2100
I want to get let's say every 3rd lowest population per State, which would have:
City State Pop
San Jose CA 945
Las Vegas NV 583
Eugene OR 156
Dallas TX 1197
I tried this one:
ord_pop_state <- aggregate(Pop ~ State , data = ord_pop, function(x) { x[3] } )
And I get this one:
State Pop
CA 945
NV 583
OR 156
TX 1197
What do I lack on this one, in order for me to get the desired output that includes the City?
I would suggest to try data.table package for such task as the syntax is easier and the code is more efficient. I would also suggest to add order function in order to make sure that the data is sorted
library(data.table)
setDT(ord_pop)[order(Pop), .SD[3L], keyby = State]
# State City Pop
# 1: CA San Jose 945
# 2: NV Las Vegas 583
# 3: OR Eugene 156
# 4: TX Dallas 1197
So basically, first the data was ordered by Pop, then we subsetted .SD (which the notation parameter of the data itself) by State
Though this is easily solvable with base R too (we will assume that the data is sorted here), we can just create an index per group and then just do a simple subset by that index
ord_pop$indx <- with(ord_pop, ave(Pop, State, FUN = seq))
ord_pop[ord_pop$indx == 3L, ]
# City State Pop indx
# 3 San Jose CA 945 3
# 8 Las Vegas NV 583 3
# 11 Eugene OR 156 3
# 15 Dallas TX 1197 3
Here is a dplyr version:
df2 <- df %>%
group_by(state) %>% # Group observations by state
arrange(-pop) %>% # Within those groups, sort in descending order by pop
slice(3) # Extract the third row in each arranged group
Here's the toy data I used to test it:
set.seed(1)
df <- data.frame(state = rep(LETTERS[1:3], each = 5), city = rep(letters[1:5], 3), pop = round(rnorm(15, 1000, 100), digits=0))
And here's the output from that; it's a coincidence that 'b' was third-largest in each case, not a glitch in the code:
> df2
Source: local data frame [3 x 3]
Groups: state
state city pop
1 A b 1018
2 B b 1049
3 C b 1039
In R same end results can be achieved using different packages.Choice of package is a trade-off between efficiency and simplicity of code.
Since you come from a strong SQL background,this might be easier to use:
library(sqldf)
#Example to return 3rd lowest population of a State
result <-sqldf('Select City,State,Pop from data order by Pop limit 1 offset 2;')
#Note the SQL query is a sample and needs to be modifed to get desired result.

stuck making a data frame after using street2coordinates (R)

I am trying to follow the tutorial outlined here but having trouble
But I am running into a problem at this step:
my_crime <- data.frame(year=my_crime$Year, community=my_crime$Community.Area,
type=my_crime$Primary.Type, arrest=my_crime$Arrest,
latitude=my_crime$Latitude, longitude=my_crime$Longitude)
My equivalent step is:
geocode <- data.frame(latitude=geocode$lat, longitude=geocode$long)
I get the following error:
Error in geocode$lat : $ operator is invalid for atomic vectors
I made the geocode dataset by sending a list of addresses to the street2coordinates website and getting back a list of long/lats (as outlined here) It seems that something is wrong with the dataset I created coming out of that. Here is the part where I make geocode:
data2 <- paste0("[",paste(paste0("\"",fm$V2,"\""),collapse=","),"]")
data2
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data2)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json,
function(x) c(address=paste(x$street_address, x$locality, x$region), long=x$longitude,lat=x$latitude)))
geocode
Thank you for any and all help!
Results of str(geocode) after the first do.call (I altered the addresses):
chr [1:2, 1:3] "123 Main St Anytown MA" "669 Main St Anytown MA" "-65.5" "-33.4" "22.1" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] " 123 Main St Anytown MA" " 669 Main St Anytown MA"
..$ : chr [1:3] "address" "long" "lat"
Or you can use the RDSTK package and do the same thing:
library(RDSTK)
data <- c("1208 Buckingham Drive, Crystal Lake, IL 60014",
"9820 State Street East, Paducah, KY 42001",
"685 Park Place, Saint Petersburg, FL 33702",
"5316 4th Avenue, Charlotte, NC 28205",
"2994 Somerset Drive, Baldwinsville, NY 13027",
"5457 5th Street South, Tallahassee, FL 32303")
geocode <- do.call(rbind, lapply(data, street2coordinates))
geocode
## full.address country_code3 latitude
## 1 1208 Buckingham Drive, Crystal Lake, IL 60014 USA 42.21893
## 2 9820 State Street East, Paducah, KY 42001 USA 36.50045
## 3 685 Park Place, Saint Petersburg, FL 33702 USA 27.96470
## 4 5316 4th Avenue, Charlotte, NC 28205 USA 35.22241
## 5 2994 Somerset Drive, Baldwinsville, NY 13027 USA 42.94575
## 6 5457 5th Street South, Tallahassee, FL 32303 USA 30.45489
## country_name longitude street_address region confidence
## 1 United States -88.33914 474 Buckingham Dr IL 0.805
## 2 United States -88.32971 498 State St KY 0.551
## 3 United States -82.79733 685 Park St FL 0.721
## 4 United States -80.80540 1698 Firth Ct NC 0.512
## 5 United States -76.56455 98 Somerset Ave NY 0.537
## 6 United States -84.29354 699 W 5th Ave FL 0.610
## street_number locality street_name fips_county country_code
## 1 474 Crystal Lake Buckingham Dr 17111 US
## 2 498 Hazel State St 21035 US
## 3 685 Clearwater Park St 12103 US
## 4 1698 Charlotte Firth Ct 37119 US
## 5 98 Auburn Somerset Ave 36011 US
## 6 699 Tallahassee W 5th Ave 12073 US
Currently, your do.call creates a matrix (using rbind and c), coercing all the numeric into characters.
The following should turn your list "json" into the data.frame "geocode" will the information you need, i.e. "address", "long" and "lat".
foo <- function(x) data.frame(address=paste(x$street_address, x$locality,
x$region), long=x$longitude,lat=x$latitude)
geocode <- do.call(rbind, sapply(json, foo))

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources