Pretty simple problem, I think, but not sure of the proper solution. Have done some research on this and think I recall seeing a solution somewhere, but cannot remember where...anyway,
Want to get DP03, one-year acs data for all Ohio counties, year 2019. However, The code below only accesses 39 of Ohio's 88 counties. How can I access the remaining counties?
My guess is that data is only being pulled for counties with populations greater than 60,000.
library(tidycensus)
library(tidyverse)
acs_2019 <- load_variables(2019, dataset = "acs1/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name, label)
Ohio_county <-
get_acs(geography = "county",
year = 2019,
state = "OH",
survey = "acs1",
variables = DP03,
output = "wide")
This results in a table that looks like this...
Ohio_county
# A tibble: 39 x 550
GEOID NAME `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39057 Gree~ 138295 815 138295 NA 87465
2 39043 Erie~ 61316 516 61316 NA 38013
3 39153 Summ~ 442279 1273 442279 NA 286777
4 39029 Colu~ 83317 634 83317 NA 48375
5 39099 Maho~ 188298 687 188298 NA 113806
6 39145 Scio~ 60956 588 60956 NA 29928
7 39003 Alle~ 81560 377 81560 NA 49316
8 39023 Clar~ 108730 549 108730 NA 64874
9 39093 Lora~ 250606 896 250606 NA 150136
10 39113 Mont~ 428140 954 428140 NA 267189
Pretty sure I've seen a solution somewhere, but cannot recall where.
Any help would be appreciated since it would let the office more easily pull census data rather than wading through the US Census Bureau site. Best of luck and Thank you!
My colleague, who already pulled the data, did not specify whether or not the DP03 data came from the ACS 1 year survey or the ACS 5 year survey. As it turns out, it was from the ACS 5 year survey, which includes all Ohio counties, not just those counties over 65,000 population. Follow comments above for a description of how this answer was determined.
Code for all counties is here
library(tidycensus)
library(tidyverse)
acs_2018 <- load_variables(2018, dataset = "acs5/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name)
Ohio_county <-
get_acs(geography = "county",
year = 2018,
state = "OH",
survey = "acs5",
variables = DP03,
output = "wide")
Related
I understand the problem and showed all my work. I'm working through the modern dive data science book (https://moderndive.com/3-wrangling.html#joins book), and got stuck on (LC3.20) at the end of chapter 3.Using the nycflights13 package on R and dplyr, I'm supposed to generate a tibble that has only two columns, airline name and seat miles. Seat miles is just seats * miles. I understand the problem and I thought my code was going to output the correct result, however my seat miles are different for each airline carrier than in the solution. Can someone please help me to figure out why my code went wrong. Additionally, I do understand the books solution, I just don't know why my solution is wrong. I posted all my work.
#seat miles = miles*seats
View(flights) #distance and identifiers year and tail num and carrier
View(airlines) # names and indentifiers carrier
View(planes) #seats and identifiers year and tail num
#join names to flights
named_flights <- flights %>%
inner_join(airlines, by = 'carrier')
named_flights #same number of rows, all good
flights
#join seats to named_flights
named_seat_flights <- named_flights %>%
inner_join(planes, by = c('tailnum'))
named_seat_flights #noticed 52,596 rows are missing
#when joining tailnum to named_flights
table(is.na(select(named_flights, 'tailnum')))
#2512 rows that has NA values for tailnum in named_flights
table(is.na(select(planes, 'tailnum')))
#no tailnum data is missing from planes dataset
#and since a given plane (with a given tailnum)
#can take off multiple times per year
#we can conclude that the 52,596 missing rows
#are a result of the missing tailnum data in flights (also named_flights)
named_seat_miles_by_airline_name <- named_seat_flights %>%
group_by(name) %>%
summarise(seat_miles = sum(seats, na.rm = T)*sum(distance,na.rm = T)) %>%
rename(airline_name = name) %>%
arrange(desc(seat_miles))
named_seat_miles_by_airline_name #not correct
View(named_seat_miles_by_airline_name)
flights %>% # book solution
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
arrange(desc(ASM))**strong text**
The output of my code is
# A tibble: 16 x 2
airline_name seat_miles
<chr> <dbl>
1 United Air Lines Inc. 8.73e14
2 Delta Air Lines Inc. 4.82e14
3 JetBlue Airways 4.13e14
4 ExpressJet Airlines I~ 9.82e13
5 US Airways Inc. 3.83e13
6 American Airlines Inc. 3.38e13
7 Southwest Airlines Co. 2.10e13
8 Endeavor Air Inc. 1.28e13
9 Virgin America 1.19e13
10 AirTran Airways Corpo~ 6.68e11
11 Alaska Airlines Inc. 2.24e11
12 Hawaiian Airlines Inc. 2.20e11
13 Frontier Airlines Inc. 1.17e11
14 Mesa Airlines Inc. 1.17e10
15 Envoy Air 7.10e 9
16 SkyWest Airlines Inc. 4.08e 7
The output of books code is
# A tibble: 16 x 2
carrier ASM
<chr> <dbl>
1 UA 15516377526
2 DL 10532885801
3 B6 9618222135
4 AA 3677292231
5 US 2533505829
6 VX 2296680778
7 EV 1817236275
8 WN 1718116857
9 9E 776970310
10 HA 642478122
11 AS 314104736
12 FL 219628520
13 F9 184832280
14 YV 20163632
15 MQ 7162420
16 OO 1299835
Also, I know I have airline names instead of carrier, but thats actually what was asked.
The code replaces the sum of the products with the product of the sums.
Compare these:
...
filter(!is.na(seats)) %>%
summarise(seat_miles_sums = sum(seats, na.rm = T)*sum(distance,na.rm = T),
seat_miles = sum(seats*distance))
...
Graphically, the question is asking for something like the areas below left, but your code calculates the area below right.
XXX YY XXXYY
XXX + YY < XXXYY
YY XXXYY
I have a set of survey design data for each quarter/year in RDs format on my disk. The data is like this:
Year Quarter Age
2010 1 27
2010 1 32
2010 1 34
...
I'm using the function svymean(formula=~Age, na.rm = T, design = data20101) to estimate the mean of the age variable for each year/quarter file. I would like to run this more efficiently in a way that I could run the function in a loop and then save the results in one single data frame.
The output I'm looking for is to produce such a dataframe:
Year Quarter Mean_Age
2010 1 31.1
2010 1 32.4
2010 1 30.9
2010 1 34.5
2010 2 36.3
2010 2 31.2
2010 2 30.8
2010 2 35.6
...
Regards,
I don't have enough rep to comment. I see r2evans is making good suggestions as to how you might read in your big data. You will obviously need to list the data in some way if you are to iterate through it. This method iterates through the list of filenames given your data is all in one directory by itself. It also does not save more than one dataset at a time which is ideal if the only thing you want is the output/grouped mean ages (not ideal if you are running more analysis besides this). I'm not sure what was most pressing from your question, but below is a general model of how to approach your problem.
library(dplyr)
output <- data.frame(Year = numeric(),
Quarter = numeric(),
Mean_Age = numeric())
filepath <- "./filpath_to_data/"
files_list <- list.files(filepath)
for (i in 1:length(files_list)){
output <- read.csv(paste0(filepath, files_list[i])) %>%
group_by(Year, Quarter) %>%
summarise(Mean_Age = mean(Age), .groups = "drop") %>%
add_rows(output)
}
output
Using NASA's API to retrieve information on Mars' weather, I retrieve a list of list of lists.
In python pandas does a beautiful job of formatting the data; however, I have to code the API in R.
This is how the data is structured in R, even after using jsonlite's fromJSON(x, flatten = TRUE) function.
I'd like to structure the raw data to be like the pandas table.
Here is my API code:
library(httr)
library(jsonlite)
req <- "https://api.nasa.gov/insight_weather/?api_key=&feedtype=json&ver=1.0"
response <- GET(req)
response <- content(response, as="text")
mars <- fromJSON(response, flatten = TRUE)
There's more information being returned from the API query than the table screenshot shows but I've focused on just returning a table of a similar structure to the example. If you want the extra information like wind direction this is of a different structure and might be easier to parse separately and merge.
library(jsonlite)
library(purrr)
library(dplyr)
library(tidyr)
req <- "https://api.nasa.gov/insight_weather/?api_key=DEMO_KEY&feedtype=json&ver=1.0"
mars <- fromJSON(req)
map(mars[1:7], ~unlist(.x[1:6]) %>%
bind_rows) %>%
bind_rows(.id = "day") %>%
pivot_longer(cols = grep("\\.", names(.)), names_sep = "\\.", names_to = c(".value", "var"))
# A tibble: 28 x 8
day First_UTC Last_UTC Season var AT HWS PRE
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer av -65.475 5.364 637.752
2 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer ct 178174 79226 89083
3 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer mn -100.044 0.236 618.015
4 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer mx -16.815 21.146 653.7326
5 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer av -62.449 5.683 636.87
6 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer ct 211897 95539 105800
7 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer mn -101.272 0.205 618.1757
8 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer mx -16.931 20.986 653.4973
9 404 2020-01-15T07:44:09Z 2020-01-16T08:23:44Z summer av -63.622 5.303 636.148
10 404 2020-01-15T07:44:09Z 2020-01-16T08:23:44Z summer ct 293286 132690 158958
# … with 18 more rows
I have a data frame containing a number of projects + their start date + their coordinates(long/lat) and I have a data frame containing a number of (fictional) respondents + the date they were surveyed + their coordinates:
respond_id<- c(1:5)
survey_year<- c(2007, 2005, 2008, 2004, 2005)
lat_1<- c(53.780928, 54.025200, 53.931432, 53.881048, 54.083359)
long_1<- c(9.614991, 9.349862, 9.473498, 10.685581, 10.026894)
project_id<- c(1111:1114)
year_start<- c(2007, 2007, 2006, 2008)
lat_2<- c(54.022881, 54.022881, 53.931753, 53.750523)
long_2<- c(9.381104, 9.381104, 9.505700, 9.666336)
survey<- data.frame(respond_id, survey_year, lat_1, long_1)
projects<- data.frame(project_id, year_start, lat_2, long_2)
Now, I want to create a new variable survey$project_nearby that counts the amount of projects located nearby (here: 5km) the respondents. So the data frame survey should look somewhat like this (other results possible):
> survey
respond_id survey_year lat_1 long_1 projects_nearby
1 1 2007 53.780928 9.614991 0
2 2 2005 54.025200 9.349862 0
3 3 2008 53.931432 9.473498 1
4 4 2004 53.881048 10.685581 0
5 5 2005 54.083359 10.026894 0
Special attention needs to be paid to the start years of the projects and the year the surveys were conducted: If a respondent was asked in 2007, but the project nearby was completed in 2008, this project naturally does not count as project nearby.
I thought of creating a distance matrix and then just counting the number of rows containing a distance smaller than 5km... but I don't know how to create this distance matrix. And maybe a for loop would be easier?
Could anyone help me or give me a hint, what would be the code for doing this?
EDIT: I edited the expected values of survey$projects_nearby. Now these values should match with actual amount of projects located nearby the respective respondents.
I don't think the correct answer is that shown? Below I left_join by the year so that every row of survey will be replicated for every matching projects. Then I filter to rows where the lats are below 5 km. Count them and join back to the original survey.
Slightly confusing results too as project1 and 2 from same year are in same location. I count them twice with this code.
>survey
respond_id survey_year lat_1 long_1
1 1 2007 53.78093 9.614991
2 2 2005 54.02520 9.349862
3 3 2008 53.93143 9.473498
4 4 2004 53.88105 10.685581
5 5 2005 54.08336 10.026894
>projects
> projects
project_id year_start lat_2 long_2
1 1111 2007 54.02288 9.381104
2 1112 2007 54.02288 9.381104
3 1113 2006 53.93175 9.505700
4 1114 2008 53.75052 9.666336
> left_join(survey, projects, by = c( "survey_year"="year_start")) %>%
+ dplyr::filter( sqrt((lat_1-lat_2)^2 + (long_1-long_2)^2 ) < 5) %>%
+ group_by(respond_id, survey_year, lat_1, long_1) %>%
+ summarise(projects_nearby = n()) %>%
+ right_join(survey)
Joining, by = c("respond_id", "survey_year", "lat_1", "long_1")
Source: local data frame [5 x 5]
Groups: respond_id, survey_year, lat_1 [?]
respond_id survey_year lat_1 long_1 projects_nearby
<int> <dbl> <dbl> <dbl> <int>
1 1 2007 53.78093 9.614991 2
2 2 2005 54.02520 9.349862 NA
3 3 2008 53.93143 9.473498 1
4 4 2004 53.88105 10.685581 NA
5 5 2005 54.08336 10.026894 NA
.. you can of course change NA to zero if appropriate...
You can use the sp package to find the distances, and then just count the number that are nearby. That is,
library(sp)
survey.loc <- matrix(as.numeric(as.character(unlist(survey[, 3:4]))), ncol = 2)
project.loc <- matrix(as.numeric(as.character(unlist(projects[, 3:4]))), ncol = 2)
distances <- spDists(survey.loc, project.loc, longlat = TRUE)
survey$project_nearby <- apply(distances, 1, function(x) sum(x<5))
I hope this helps!
EDIT:
My apologies for not considering the date.
library(sp)
survey.loc <- matrix(as.numeric(as.character(unlist(survey[, 3:4]))), ncol = 2)
project.loc <- matrix(as.numeric(as.character(unlist(projects[, 3:4]))), ncol = 2)
distances <- spDists(survey.loc, project.loc, longlat = TRUE)
year.diff <- sapply(projects$year_start, function(x) survey$survey_year-x)
year.diff <- ifelse(year.diff < 0, Inf, 1)
survey$project_nearby <- apply(year.diff*distances, 1, function(x) sum(x<5))
I think you have to convert your lat, long coordinates to coordinates in a plane or using this link below from a previous post:
harvesine distance
https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
Once you have distances to a particular location in the projects data frame, you may need to find similar points using knn or any other technique of your preference.
I need to reshape some data I downloaded from the worldbank database. However I have some difficulties with it.
The goal is that it looks like this:
year CH DE US
1980 17383.38 11746.40 12179.56
1981 15833.74 9879.46 13526.19
1982 16133.97 9593.66 13932.68
1983 16007.82 9545.86 15000.09
1984 15229.82 9012.48 16539.38
I use the following code to download data. WDI and RJSONO packages are required.
wdi <- WDI(country = c("CH","DE","US"), indicator = "NY.GDP.PCAP.CD" ,start = 1980, end = 2010, extra = F)
then I reshaped the following way:
wdi2 <- reshape(wdi, direction = "wide", timevar="year", v.names="NY.GDP.PCAP.CD", idvar="country", drop="iso2c")
The output does not match my expectations of how it should look:
> wdi2
country NY.GDP.PCAP.CD.2010 NY.GDP.PCAP.CD.2009 NY.GDP.PCAP.CD.2008
1 Switzerland 70572.66 65790.07 68555.37
32 Germany 40163.82 40275.25 44132.04
63 United States 46615.51 45305.05 46759.56 ...
This one is a bit better but still not what I want:
> t(wdi2)
1 32 63
country "Switzerland" "Germany" "United States"
NY.GDP.PCAP.CD.2010 "70572.66" "40163.82" "46615.51"
NY.GDP.PCAP.CD.2009 "65790.07" "40275.25" "45305.05"
NY.GDP.PCAP.CD.2008 "68555.37" "44132.04" "46759.56"
NY.GDP.PCAP.CD.2007 "59663.77" "40402.99" "46349.12"
The wdi object looks like this:
> wdi
iso2c country NY.GDP.PCAP.CD year
1 CH Switzerland 70572.657 2010
2 CH Switzerland 65790.067 2009
3 CH Switzerland 68555.372 2008
4 CH Switzerland 59663.770 2007
...
30 CH Switzerland 16219.906 1981
31 CH Switzerland 17807.340 1980
32 DE Germany 40163.817 2010
33 DE Germany 40275.251 2009
34 DE Germany 44132.042 2008
...
62 DE Germany 11746.404 1980
63 US United States 46615.511 2010
64 US United States 45305.052 2009
In front of a computer again... so here's an update.
As mentioned in my comments, dcast from "reshape2" is quite convenient for this. You can get similar functionality from xtabs in base R if you're just doing the reshaping step.
x <- xtabs(NY.GDP.PCAP.CD ~ year + iso2c, wdi)
head(x)
# iso2c
# year CH DE US
# 1980 17807.34 11746.404 12179.56
# 1981 16219.91 9879.457 13526.19
# 1982 16527.46 9593.657 13932.68
# 1983 16398.24 9545.859 15000.09
# 1984 15601.26 9012.479 16539.38
# 1985 15748.95 9125.121 17588.81
xtabs creates a matrix of class "xtabs", so to get a data.frame, wrap the output in as.data.frame.matrix.
head(as.data.frame.matrix(x))
# CH DE US
# 1980 17807.34 11746.404 12179.56
# 1981 16219.91 9879.457 13526.19
# 1982 16527.46 9593.657 13932.68
# 1983 16398.24 9545.859 15000.09
# 1984 15601.26 9012.479 16539.38
# 1985 15748.95 9125.121 17588.81
To answer your other question you had asked in your comment: However isn't their a smarter way that puts the data directly to the right format by using only the reshape function?. The answer is "Yes. Just swap what you were using for the "idvar" and "timevar" in your original reshape attempt."
y <- reshape(wdi[-2], direction = "wide", idvar="year", timevar="iso2c")
## Optional step to clean up the resulting names
names(y) <- gsub("NY.GDP.PCAP.CD.", "", names(y))
head(y)
# year CH DE US
# 1 2010 70572.66 40163.82 46615.51
# 2 2009 65790.07 40275.25 45305.05
# 3 2008 68555.37 44132.04 46759.56
# 4 2007 59663.77 40402.99 46349.12
# 5 2006 54140.50 35237.60 44622.64
# 6 2005 51734.30 33542.78 42516.39
When using the reshape function, sometimes it helps to ignore the "id" and "time" parts of the argument names and think instead about where they are to go. ID variables usually make up a column, and time variables usually spread out wide, one column for each time. So, even though we might think of "country" as the actual ID variable, for the data format that you want, it is more of a time variable.
Hopefully this helps, even though you've already accepted an answer :)
It is really easy to achieve using reshape2.
require(reshape2)
dcast(wdi[,-2], year ~ iso2c, value.var = 'NY.GDP.PCAP.CD')
EDIT. Oops, I did not see the comment posted by Ananda Mahto with the same solution. Anand, if you post your comment as an answer, I will delete mine.
Here is a base R solution.
# renames the NY.GDP column and drops all but two columns
trans_one <- function(dat) {
newcol <- dat[1, "iso2c"]
idx <- which(colnames(dat)=="NY.GDP.PCAP.CD")
colnames(dat)[[idx]] <- newcol
dat <- dat[,c(newcol, "year")]
dat
}
# split by country
sp <- split(wdi, wdi$iso2c)
# merge
fun <- function(x,y) {
merge(x, trans_one(y), by="year", all=TRUE)
}
Reduce(fun, x=tail(sp, -1), init=trans_one(sp[[1]]))
However, the reshape2 looks more straightforward to me now.