R - Purrr - Apply Function to Tibbles in List - r

I need some help applying a function to four tibbles individually stored in the same list.
Function:
status_fun <- function(Status,
Escalated,
Created,
Resolved
){
if(Escalated == "Yes"){
return("Escalated")
} else if(Status == "Closed" && (month(Created) == month(Resolved) || Resolved - Created < 5
)
){
return("Closed")
} else {
return("Not Solved")
}
}
I have a list with 4 tibbles inside of different sizes.
I simply want to apply the function above that uses four columns to each tibble, but I'm getting all sorts of errors. I've searched as much as I can and read R4DS and other posts here, but I can't find a solution.
dummy %>%
map(., status_fun)
Error in .f(.x[[i]], ...) :
argument "Escalated" is missing with no default
dummy %>%
map(~ map(., status_fun))
Error in .f(.x[[i]], ...) :
argument "Escalated" is missing with no default
The following returns a list with only one value, which I'm not interest in, I want a list with four tibbles with the same dimensions (rows) as the input
dummy %>%
map(., ~ status_fun(Status = 'Status', Escalated = 'Escalated', Created = 'Created', Resolved = 'Resolved'))
[[1]]
[1] "Not Solved"
[[2]]
[1] "Not Solved"
[[3]]
[1] "Not Solved"
[[4]]
[1] "Not Solved"
The dummy list is the following:
[[1]]
# A tibble: 589 x 5
Created Resolved Status Country Escalated
<date> <date> <chr> <chr> <chr>
1 2020-04-03 2020-04-08 Closed Luxembourg No
2 2020-03-31 NA In Progress France No
3 2020-03-31 NA In Progress France No
4 2020-03-31 NA In Progress Luxembourg No
5 2020-03-31 NA In Progress Luxembourg No
6 2020-03-30 NA In Progress France Yes
7 2020-03-27 NA In Progress Ireland No
8 2020-03-27 2020-04-10 Closed Luxembourg No
9 2020-03-27 NA In Progress Luxembourg No
10 2020-03-27 2020-03-30 Closed Ireland No
# ... with 579 more rows
[[2]]
# A tibble: 316 x 5
Created Resolved Status Country Escalated
<date> <date> <chr> <chr> <chr>
1 2020-04-13 NA Open Luxembourg No
2 2020-04-13 NA Open Spain No
3 2020-04-07 NA Open France No
4 2020-04-03 NA In Progress Luxembourg No
5 2020-03-30 NA Awaiting Information Luxembourg No
6 2020-03-30 NA Awaiting Information France Yes
7 2020-03-30 2020-03-31 Closed France No
8 2020-03-30 NA Awaiting Information France No
9 2020-03-30 NA Awaiting Information Spain No
10 2020-03-30 NA Awaiting Information Sweden No
# ... with 306 more rows
[[3]]
# A tibble: 64 x 5
Created Resolved Status Country Escalated
<date> <date> <chr> <chr> <chr>
1 2020-04-13 NA Open Chile No
2 2020-04-10 NA Open Mexico Yes
3 2020-04-10 NA Awaiting Information Mexico No
4 2020-04-09 NA Open Chile No
5 2020-04-03 2020-04-06 Closed Mexico Yes
6 2020-04-02 2020-04-02 Closed Mexico No
7 2020-04-01 2020-04-01 Closed Mexico No
8 2020-03-31 2020-04-01 Closed Brazil No
9 2020-03-30 2020-03-31 Closed Mexico No
10 2020-03-27 2020-04-06 Closed Mexico No
# ... with 54 more rows
[[4]]
# A tibble: 30 x 5
Created Resolved Status Country Escalated
<date> <date> <chr> <chr> <chr>
1 2020-04-13 NA Open Chile No
2 2020-04-07 NA Open Brazil No
3 2020-03-23 2020-03-25 Closed Chile No
4 2020-03-17 2020-03-18 Closed Chile No
5 2020-03-16 NA Open Mexico No
6 2020-03-11 2020-03-11 Closed Brazil No
7 2020-03-11 2020-03-12 Closed Brazil No
8 2020-03-10 2020-03-10 Closed Brazil No
9 2020-03-09 NA In Progress Brazil No
10 2020-03-02 2020-03-03 Closed Brazil No
# ... with 20 more rows
What am I missing?
I've tried all sorts of pmap, map_2, the instructions here Code not working using map from purrr package in R
and here Apply function to nested loop (purrr package?)
with no success..
Thanks in advance for someone willing to take their time to solve my problem.
> version _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
packageVersion("tidyverse")
[1] ‘1.3.0’
packageVersion("lubridate")
[1] ‘1.7.8’

One issue is that you are passing a single data.frame to a function that expects 4 arguments. To fix that you could change your function to:
new_fx = function (DF) {
Status = DF$Status
Escalated = DF$Escalated
...
}
map(dummy, new_fx)
The next potential issue is your use of if ... else.... Because this was not a reproducible example with expected output, I am assuming you want to add a column with the if ... else... statement. You will want to get rid of the double && and || because they will evaluate to a single logical value.
Along with that, switch to using ifelse or, since you are in tidyverse, you could use case_when() would produce a vector of expected length.

For anyone struggling with mutating columns on several tibbles inside a list object, the below code worked on the problem above:
status_fun <- function(df){
Escalated = df$Escalated
Status = df$Status
Created = df$Created
Resolved = df$Resolved
dplyr::mutate(df,
Status = case_when(
Escalated == "Yes" ~ "Escalated",
(Status == "Closed" &
(month(Created) == month(Resolved) | Resolved - Created < 5)) ~ "Closed",
TRUE ~ "Not Solved"
)
)
}
dummy <- dummy %>% map(., status_fun)

Related

Divide Results Into Three Groups Based On Condition And Date Check

This is one I have been having trouble with for days. I need to take my data and divide results into three groups based on conditions and date check. You can see this in the original data table that I have provided.
Table with original data
Basically, I need to do this by individual. If they fail then they have 7 days to pass. If they fail and pass within 7 days then they go in the Yes category. If they fail and then have another failure within 7 days, they go in the No category. If they have a failed result and nothing after that, then they go in the Refused category.
So, I need to test the row after a Fail for a Pass or Fail or Nothing by individual and then check that it is within 7 days.
Individuals such as Sam, since he did not take another test after the second failure, can be in multiple groups at the same time. Luke on the other hand, passed but it was after the 7 day period so they scored a refused. The final table would then look like this:
enter image description here
I have tried to use if-else statements but I don't know how to check the next row of the same individual and ignore any other rows other than the row that exist, if any, right after Fail per individual.
I don't know if this can be done in R but I appreciate any help I can get on this.
Thank you!
It is not a complete solution, but my suggestion.
Your dataset:
# A tibble: 13 x 4
name result time_1 time_2
<chr> <chr> <date> <date>
1 Joe Fail 2022-03-01 NA
2 Joe Pass NA 2022-03-05
3 Heather Fail 2022-03-21 NA
4 Heather Pass NA 2022-03-26
5 Heather Pass NA 2022-03-27
6 Heather Fail 2022-03-13 NA
7 Heather Pass NA 2022-03-17
8 Sam Fail 2022-03-20 NA
9 Sam Fail 2022-03-21 NA
10 Luke Fail 2022-03-11 NA
11 Luke Pass NA 2022-03-13
12 Luke Fail 2022-03-19 NA
13 Luke Pass NA 2022-03-29
library(lubridate)
library(tidyverse)
df_clean <- df %>%
arrange(name, result, time_1, time_2) %>%
group_by(name, result) %>%
mutate(attempt = 1:n()) %>%
unite(col = "result",
c("result", "attempt"),
sep = "_", remove = TRUE) %>%
unite(col = "time",
c("time_1", "time_2"),
sep = "", remove = TRUE) %>%
mutate(time = time %>% str_remove_all("NA") %>% as.Date()) %>%
ungroup() %>%
spread(key = result, value = time)
"Cleaned dataset":
# A tibble: 4 x 6
name Fail_1 Fail_2 Pass_1 Pass_2 Pass_3
<chr> <date> <date> <date> <date> <date>
1 Heather 2022-03-13 2022-03-21 2022-03-17 2022-03-26 2022-03-27
2 Joe 2022-03-01 NA 2022-03-05 NA NA
3 Luke 2022-03-11 2022-03-19 2022-03-13 2022-03-29 NA
4 Sam 2022-03-20 2022-03-21 NA NA NA
df_clean %>%
mutate(yes = case_when(interval(Fail_1, Pass_1) %>%
as.numeric("days") <= 7 ~ 1,
TRUE ~ 0),
refused = case_when(is.Date(Fail_1) & is.na(Pass_1) ~ 1,
TRUE ~ 0))
# A tibble: 4 x 8
name Fail_1 Fail_2 Pass_1 Pass_2 Pass_3 yes refused
<chr> <date> <date> <date> <date> <date> <dbl> <dbl>
1 Heather 2022-03-13 2022-03-21 2022-03-17 2022-03-26 2022-03-27 1 0
2 Joe 2022-03-01 NA 2022-03-05 NA NA 1 0
3 Luke 2022-03-11 2022-03-19 2022-03-13 2022-03-29 NA 1 0
4 Sam 2022-03-20 2022-03-21 NA NA NA 0 1

Creating a dummy/indicator variable that references date variables from two different data frames

My problem is quite basic, but I'm new to R and I've been trying to solve this issue for a couple of days now without any success :(
What I'm working with
This is the CoronaNet_cleaned data frame.
country date_start date_end
South Africa 2020-03-22 NA
South Africa 2020-04-12 2020-06-02
Australia 2021-02-11 2020-04-12
Australia 2020-06-10 NA
United States 2020-01-01 NA
United States 2020-12-08 NA
This is the tweetgovuser data frame
country screen_name created_at text
South Africa HealthZA 2020-12-08 The number of health care workers....
...
What I want
I want to create a column in tweetgovuser called lockdown_dummy. I want this indicator/dummy variable to be created based off of three conditions:
if created_at (tweetgovuser) matches date_start or date_end (CoronaNet_cleaned), let lockdown_dummy (in tweetgovuser) = 1
if created_at is between the dates of date_end and date_start, let lockdown_dummy = 1
If none of the above conditions are true, let lockdown_dummy = 0
The end product should look like this:
country screen_name created_at text lockdown_dummy
South Africa HealthZA 2020-12-08 The number.... 1
...
What I've tried
I've tried several different blocks of code but recently I've written this very crude and clearly poorly written code to execute this:
lockdown_dummy <- case_when(
created_at == date_start ~ 1,
created_at == date_end ~ 1,
"date_start" %<% created_at %<% "date_end" ~ 1
TRUE ~ 0
)
Nice question. Next time, try to share more data with some importing code so that it is easier for us to make experiments. You can use functions like dput() for this.
In your case, the first thing to do is to join the two tables. Here I used left_join().
Then, you only need a call to ifelse() to check if the date is higher than the start and lesser than the end.
Here is the code:
library(tidyverse)
library(lubridate)
CoronaNet_cleaned = read.table(header=T, text="
country date_start date_end
'South Africa' 2020-03-22 NA
'South Africa' 2020-03-22 2027-03-22
'South Africa' 2020-04-12 2020-06-02
Australia 2021-02-11 2020-04-12
Australia 2020-06-10 NA
'United States' 2020-01-01 NA
'United States' 2020-12-08 NA")
tweetgovuser = read.table(header=T, text="
country screen_name created_at text
'South Africa' HealthZA 2020-12-08 'The number of health care workers'
")
CoronaNet_cleaned %>%
left_join(tweetgovuser, by="country") %>%
mutate(
date_start = ymd(date_start), #probably not needed with real data
date_end = ymd(date_end), #probably not needed with real data
created_at = ymd(created_at), #probably not needed with real data
dummy = ifelse(created_at>=date_start & created_at<=date_end, 1, 0),
)
#> # A tibble: 7 x 7
#> country date_start date_end screen_name created_at text dummy
#> <chr> <date> <date> <chr> <date> <chr> <dbl>
#> 1 South Afr~ 2020-03-22 NA HealthZA 2020-12-08 The number of h~ NA
#> 2 South Afr~ 2020-03-22 2027-03-22 HealthZA 2020-12-08 The number of h~ 1
#> 3 South Afr~ 2020-04-12 2020-06-02 HealthZA 2020-12-08 The number of h~ 0
#> 4 Australia 2021-02-11 2020-04-12 <NA> NA <NA> NA
#> 5 Australia 2020-06-10 NA <NA> NA <NA> NA
#> 6 United St~ 2020-01-01 NA <NA> NA <NA> NA
#> 7 United St~ 2020-12-08 NA <NA> NA <NA> NA
Created on 2021-05-20 by the reprex package (v1.0.0)

Convert tibble to time series

I tried to download data on covid provided by the Economist's Github repository.
library(readr)
library(knitr)
myfile <- "https://raw.githubusercontent.com/TheEconomist/covid-19-excess-deaths-tracker/master/output-data/excess-deaths/all_weekly_excess_deaths.csv"
test <- read_csv(myfile)
What I get is a tibble data frame and I am unable to easily access the data stored in that tibble. I would like to look at one column, say test$covid_deaths_per_100k and re-shape that into a matrix or ts object with rows referring to time and columns referring to countries.
I tried it manually, but I failed. Then I tried with the tsibble package and failed again:
tsibble(test[c("covid_deaths_per_100k","country")],index=test$start_date)
Error: Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `date`.
ℹ It must be numeric or character.
So, I guess the problem is that the data are stacked by countries and hence the time index is duplicated. I would need some of these magic pipe functions to make this work? Is there an easy way to do that, perhaps without piping?
A valid tsibble must have distinct rows identified by key and index:
as_tsibble(test,index = start_date,key=c(country,region))
# A tsibble: 11,715 x 17 [1D]
# Key: country, region [176]
country region region_code start_date end_date days year week population total_deaths
<chr> <chr> <chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia Australia 0 2020-01-01 2020-01-07 7 2020 1 25734100 2497
2 Australia Australia 0 2020-01-08 2020-01-14 7 2020 2 25734100 2510
3 Australia Australia 0 2020-01-15 2020-01-21 7 2020 3 25734100 2501
4 Australia Australia 0 2020-01-22 2020-01-28 7 2020 4 25734100 2597
5 Australia Australia 0 2020-01-29 2020-02-04 7 2020 5 25734100 2510
6 Australia Australia 0 2020-02-05 2020-02-11 7 2020 6 25734100 2530
7 Australia Australia 0 2020-02-12 2020-02-18 7 2020 7 25734100 2613
8 Australia Australia 0 2020-02-19 2020-02-25 7 2020 8 25734100 2608
9 Australia Australia 0 2020-02-26 2020-03-03 7 2020 9 25734100 2678
10 Australia Australia 0 2020-03-04 2020-03-10 7 2020 10 25734100 2602
# ... with 11,705 more rows, and 7 more variables: covid_deaths <dbl>, expected_deaths <dbl>,
# excess_deaths <dbl>, non_covid_deaths <dbl>, covid_deaths_per_100k <dbl>,
# excess_deaths_per_100k <dbl>, excess_deaths_pct_change <dbl>
ts works best with monthly, quarterly or annual series. Here we show a few approaches.
1) monthly This creates a monthly zoo object z from the indicated test columns splitting by country and aggregating to produce a monthly time series. It then creates a ts object from that.
library(zoo)
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = as.yearmon, aggregate = sum)
as.ts(z)
2) weekly To create a weekly ts object with frequency 53
to_weekly <- function(x) {
yr <- as.integer(as.yearmon(x))
wk <- as.integer(format(as.Date(x), "%U"))
yr + wk/53
}
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = to_weekly, aggregate = sum)
as.ts(z)
3) daily If you want a series where the times are dates then omit the FUN argument and use zoo directly.
z <- read.zoo(test[c("end_date", "country", "covid_deaths")],
split = "country", aggregate = sum)

Restructuring data depending on reoccuring values in R

Right now, I try to restructure my data (it's about responsiveness of contacted people in a survey) which has a structure like this:
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test <- df_test[order(df_test$ID,df_test$Date),]
In the resulting dataframe, each line represents one contact event and, usually, all people (labelled by ID) have been contacted multiple times by different means:
#first 4 lines of dataframe:
Residence Response ID Contact Date
2 Berlin TRUE 1 Mail 2000-07-25
1 Berlin TRUE 1 Phone 2000-09-25
3 Berlin TRUE 2 In_Person 2000-02-06
4 Berlin TRUE 2 Phone 2000-10-01
To get a nice overview with focus on the contacted people for e.g. plots, I want to create a new data frame in which every line represents one contacted person, with fixed values just appearing once (e.g. ID, Residence, Response) while contact-specific values (Contact, Date) are listed in each line like so:
#restructured lines in new dataframe from first 4 lines of original dataframe:
Residence Response ID Contact Date Contact.1 Date.1
1 Berlin TRUE 1 Mail 2000-07-25 Phone 2000-09-25
2 Berlin TRUE 2 In_Person 2000-02-06 Phone 2000-10-01
With the initial sorting by date i hope to also get the contact attempts in each line in chronological order.
While i don't have any code which is close to running, i tried to at least get a dataframe with an empty column and fill it with with the extracted IDs, without duplicates:
for (i in df_test[,"ID"]){
if (df_test[i,"ID"] != df_test [i-1,"ID"]){
df_test_restructured<-append(df_test_restructured,df_test[i,"ID"])
}
}
After many unfruitful attmempts, I figured there should be some existing and more efficient strategies or functions unknown to me. Any suggestions? Thanks in advance <3
EDIT: Ideally, each row would have the contact attempt listed in order, since people also have been contacted multiple times with the same medium. I want to extract info like e.g. people have mostly responded after the first reminder email after already having been sent an initial email
Assuming you want per person (ID) one row to show at what date with what (phone, email,..) there was a contact you could do something like this with tidyverse.
library(tidyverse)
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test %>%
group_by(ID) %>%
pivot_wider(names_from = Contact, values_from = Date)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Phone Mail In_Person eMail
#> <chr> <lgl> <int> <date> <date> <date> <date>
#> 1 Berlin TRUE 1 2000-01-04 2000-09-06 NA NA
#> 2 Berlin TRUE 2 2000-03-15 NA 2000-05-19 NA
#> 3 Berlin TRUE 3 2000-11-05 NA NA 2000-05-06
#> 4 Berlin TRUE 4 2000-11-02 2000-03-29 NA NA
#> 5 Berlin TRUE 5 2000-12-20 NA 2000-04-30 NA
#> 6 Frankfurt TRUE 6 2000-02-23 NA NA 2000-02-05
#> 7 Frankfurt TRUE 7 2000-08-30 2000-11-29 NA NA
#> 8 Frankfurt FALSE 8 2000-02-20 NA 2000-08-08 NA
#> 9 Frankfurt FALSE 9 2000-12-11 NA NA 2000-05-25
#> 10 Frankfurt FALSE 10 2000-12-21 2000-01-15 NA NA
#> 11 Munich FALSE 11 2000-07-07 NA 2000-12-16 NA
#> 12 Munich FALSE 12 2000-08-26 NA NA 2000-09-03
#> 13 Munich FALSE 13 2000-05-02 2000-11-20 NA NA
#> 14 Munich FALSE 14 2000-04-05 NA 2000-09-30 NA
#> 15 Munich FALSE 15 2000-09-26 NA NA 2000-05-22
New Addition based on your new target
I am not sure if this is the tidiest way, but I guess it is what you are looking for.
df_test %>%
group_by(ID) %>%
arrange(Date) %>%
mutate(no = row_number()) %>%
pivot_wider(names_from = c(no), values_from = c(Contact,Date)) %>%
select(c(Residence:Contact_1, Date_1, Contact_2, Date_2)) %>%
arrange(ID)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Contact_1 Date_1 Contact_2 Date_2
#> <chr> <lgl> <int> <chr> <date> <chr> <date>
#> 1 Berlin TRUE 1 Mail 2000-01-09 Phone 2000-04-26
#> 2 Berlin TRUE 2 Phone 2000-01-27 In_Person 2000-10-14
#> 3 Berlin TRUE 3 eMail 2000-03-01 Phone 2000-07-14
#> 4 Berlin TRUE 4 Phone 2000-05-19 Mail 2000-09-22
#> 5 Berlin TRUE 5 Phone 2000-07-06 In_Person 2000-12-03
#> 6 Frankfurt TRUE 6 eMail 2000-07-05 Phone 2000-11-20
#> 7 Frankfurt TRUE 7 Phone 2000-02-06 Mail 2000-12-28
#> 8 Frankfurt FALSE 8 Phone 2000-04-03 In_Person 2000-09-06
#> 9 Frankfurt FALSE 9 eMail 2000-06-16 Phone 2000-06-24
#> 10 Frankfurt FALSE 10 Phone 2000-01-26 Mail 2000-05-02
#> 11 Munich FALSE 11 In_Person 2000-02-15 Phone 2000-06-28
#> 12 Munich FALSE 12 eMail 2000-03-22 Phone 2000-04-24
#> 13 Munich FALSE 13 Phone 2000-03-21 Mail 2000-08-02
#> 14 Munich FALSE 14 In_Person 2000-09-01 Phone 2000-11-27
#> 15 Munich FALSE 15 Phone 2000-05-27 eMail 2000-07-09
You can start by doing:
> df_test %>%
+ pivot_wider(names_from = Contact,values_from=Date)
# A tibble: 15 x 7
Residence Response ID Phone Mail In_Person eMail
<fct> <lgl> <int> <date> <date> <date> <date>
1 Berlin TRUE 1 2000-01-20 2000-02-18 NA NA
2 Berlin TRUE 2 2000-07-24 NA 2000-03-19 NA
Actually, plotting with your original df is really doable.

World Bank API query

I want to get data using World Bank's API. For this purpose I use follow query.
wb_data <- httr::GET("http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO?format=json") %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
data.frame()
It works pretty good. However, when I try to specify more than two variables it doesn't work.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?format=json
Note, if i change format to xml and also add source=2 because data become from same database (World Development Indicator) query works.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?source=2&formal=xml
However, if i want to get data from different databases (e.g. WDI and Doing Business) it doesn't work again.
So, my first question is how can I get multiple data from different databases using one query. According to the World Bank API tutorial I can include about 60 indicators.
My second question is how can I specify number of rows per page. As I might know I can add something like &per_page=100 to get 100 rows as an output. Should i calculate number of rows by myself or I can use something lika that &per_page=9999999 to get all data upon request.
P.S. I don't want to use any libraries (such as: wb or wbstats). I want to do it by myself and also to learn something new.
Here's an answer to your question. To use multiple indicators and return JSON, you need to provide both the source ID and the format type, as mentioned in the World Bank API tutorial. You can get the total number of pages from one of the returned JSON parameters, called "total". You can then use this value in a second GET request to return the full number of pages using the per_page parameter.
library(magrittr)
library(httr)
library(jsonlite)
# set up the target url - you need BOTH the source ID and the format parameters
target_url <- "http://api.worldbank.org/v2/country/chn;ago/indicator/AG.AGR.TRAC.NO;SP.POP.TOTL?source=2&format=json"
# look at the metadata returned for the target url
httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# the metadata is in the first item in the returned list of JSON
extract2(1)
#> $page
#> [1] 1
#>
#> $pages
#> [1] 5
#>
#> $per_page
#> [1] 50
#>
#> $total
#> [1] 240
#>
#> $sourceid
#> NULL
#>
#> $lastupdated
#> [1] "2019-12-20"
# get the total number of pages for the target url query
wb_data_totalpagenumber <- httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the first item in the returned list of JSON
extract2(1) %>%
# get the total number of pages, which is a named element called "total"
extract2("total")
# get the data
wb_data <- httr::GET(paste0(target_url, "&per_page=", wb_data_totalpagenumber)) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the data, which is the second item in the returned list of JSON
extract2(2) %>%
data.frame()
# look at the data
dim(wb_data)
#> [1] 240 11
head(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 1 AGO 2019 NA 0 AG.AGR.TRAC.NO
#> 2 AGO 2018 NA 0 AG.AGR.TRAC.NO
#> 3 AGO 2017 NA 0 AG.AGR.TRAC.NO
#> 4 AGO 2016 NA 0 AG.AGR.TRAC.NO
#> 5 AGO 2015 NA 0 AG.AGR.TRAC.NO
#> 6 AGO 2014 NA 0 AG.AGR.TRAC.NO
#> indicator.value country.id country.value
#> 1 Agricultural machinery, tractors AO Angola
#> 2 Agricultural machinery, tractors AO Angola
#> 3 Agricultural machinery, tractors AO Angola
#> 4 Agricultural machinery, tractors AO Angola
#> 5 Agricultural machinery, tractors AO Angola
#> 6 Agricultural machinery, tractors AO Angola
tail(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 235 CHN 1965 715185000 <NA> 0 SP.POP.TOTL
#> 236 CHN 1964 698355000 <NA> 0 SP.POP.TOTL
#> 237 CHN 1963 682335000 <NA> 0 SP.POP.TOTL
#> 238 CHN 1962 665770000 <NA> 0 SP.POP.TOTL
#> 239 CHN 1961 660330000 <NA> 0 SP.POP.TOTL
#> 240 CHN 1960 667070000 <NA> 0 SP.POP.TOTL
#> indicator.value country.id country.value
#> 235 Population, total CN China
#> 236 Population, total CN China
#> 237 Population, total CN China
#> 238 Population, total CN China
#> 239 Population, total CN China
#> 240 Population, total CN China
Created on 2020-01-30 by the reprex package (v0.3.0)

Resources