R combine rows and columns within a dataframe - r

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture

You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

dplyr: subset rows based on count of a column

I want to subset rows of a dataframe based on the number of observations in a given column such that I only get rows where there are n+ observations. I want to do this using Tidyverse functions, not base R functions.
For example: from the planes data from the nycflights13 package, keep all the rows where planes$manufacturer has at least 10 observations. The subset data should only have the following manufacturers:
[1] "AIRBUS" "AIRBUS INDUSTRIE" "BOEING"
[4] "BOMBARDIER INC" "EMBRAER" "MCDONNELL DOUGLAS"
[7] "MCDONNELL DOUGLAS AIRCRAFT CO" "MCDONNELL DOUGLAS CORPORATION"
Note: This post does not address this question.
You can do this:
library(dplyr)
df <- planes %>%
dplyr::group_by(manufacturer) %>%
dplyr::filter(n() > 10)
Summary of Output
df %>% count(manufacturer)
# A tibble: 8 × 2
# Groups: manufacturer [8]
manufacturer n
<chr> <int>
1 AIRBUS 336
2 AIRBUS INDUSTRIE 400
3 BOEING 1630
4 BOMBARDIER INC 368
5 EMBRAER 299
6 MCDONNELL DOUGLAS 120
7 MCDONNELL DOUGLAS AIRCRAFT CO 103
8 MCDONNELL DOUGLAS CORPORATION 14

World Bank API query

I want to get data using World Bank's API. For this purpose I use follow query.
wb_data <- httr::GET("http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO?format=json") %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
data.frame()
It works pretty good. However, when I try to specify more than two variables it doesn't work.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?format=json
Note, if i change format to xml and also add source=2 because data become from same database (World Development Indicator) query works.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?source=2&formal=xml
However, if i want to get data from different databases (e.g. WDI and Doing Business) it doesn't work again.
So, my first question is how can I get multiple data from different databases using one query. According to the World Bank API tutorial I can include about 60 indicators.
My second question is how can I specify number of rows per page. As I might know I can add something like &per_page=100 to get 100 rows as an output. Should i calculate number of rows by myself or I can use something lika that &per_page=9999999 to get all data upon request.
P.S. I don't want to use any libraries (such as: wb or wbstats). I want to do it by myself and also to learn something new.
Here's an answer to your question. To use multiple indicators and return JSON, you need to provide both the source ID and the format type, as mentioned in the World Bank API tutorial. You can get the total number of pages from one of the returned JSON parameters, called "total". You can then use this value in a second GET request to return the full number of pages using the per_page parameter.
library(magrittr)
library(httr)
library(jsonlite)
# set up the target url - you need BOTH the source ID and the format parameters
target_url <- "http://api.worldbank.org/v2/country/chn;ago/indicator/AG.AGR.TRAC.NO;SP.POP.TOTL?source=2&format=json"
# look at the metadata returned for the target url
httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# the metadata is in the first item in the returned list of JSON
extract2(1)
#> $page
#> [1] 1
#>
#> $pages
#> [1] 5
#>
#> $per_page
#> [1] 50
#>
#> $total
#> [1] 240
#>
#> $sourceid
#> NULL
#>
#> $lastupdated
#> [1] "2019-12-20"
# get the total number of pages for the target url query
wb_data_totalpagenumber <- httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the first item in the returned list of JSON
extract2(1) %>%
# get the total number of pages, which is a named element called "total"
extract2("total")
# get the data
wb_data <- httr::GET(paste0(target_url, "&per_page=", wb_data_totalpagenumber)) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the data, which is the second item in the returned list of JSON
extract2(2) %>%
data.frame()
# look at the data
dim(wb_data)
#> [1] 240 11
head(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 1 AGO 2019 NA 0 AG.AGR.TRAC.NO
#> 2 AGO 2018 NA 0 AG.AGR.TRAC.NO
#> 3 AGO 2017 NA 0 AG.AGR.TRAC.NO
#> 4 AGO 2016 NA 0 AG.AGR.TRAC.NO
#> 5 AGO 2015 NA 0 AG.AGR.TRAC.NO
#> 6 AGO 2014 NA 0 AG.AGR.TRAC.NO
#> indicator.value country.id country.value
#> 1 Agricultural machinery, tractors AO Angola
#> 2 Agricultural machinery, tractors AO Angola
#> 3 Agricultural machinery, tractors AO Angola
#> 4 Agricultural machinery, tractors AO Angola
#> 5 Agricultural machinery, tractors AO Angola
#> 6 Agricultural machinery, tractors AO Angola
tail(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 235 CHN 1965 715185000 <NA> 0 SP.POP.TOTL
#> 236 CHN 1964 698355000 <NA> 0 SP.POP.TOTL
#> 237 CHN 1963 682335000 <NA> 0 SP.POP.TOTL
#> 238 CHN 1962 665770000 <NA> 0 SP.POP.TOTL
#> 239 CHN 1961 660330000 <NA> 0 SP.POP.TOTL
#> 240 CHN 1960 667070000 <NA> 0 SP.POP.TOTL
#> indicator.value country.id country.value
#> 235 Population, total CN China
#> 236 Population, total CN China
#> 237 Population, total CN China
#> 238 Population, total CN China
#> 239 Population, total CN China
#> 240 Population, total CN China
Created on 2020-01-30 by the reprex package (v0.3.0)

Collapsing Levels of a Factor Variable in one column while summing the counts in another

I originally had a vary wide data (4 rows with 158 columns) which I used reshape::melt() on to create a long data set (624 rows x 3 columns).
Now, however, I have a data set like this:
demo <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1_a", "Writing_2_a", "Writing_3_a", "Writing_4_a",
"Writing_1_b", "Writing_2_b", "Writing_3_b", "Writing_4_b")),
counts = as.integer(c(18, 27, 99, 42, 36, 144, 99, 9)))
Which produces a table similar to the one below:
region criteria counts
North Writing_1_a 18
South Writing_2_a 27
East Writing_3_a 99
West Writing_4_a 42
North Writing_1_b 36
South Writing_2_b 144
East Writing_3_b 99
West Writing_4_b 9
Now what I want to create is something like this:
goal <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1", "Writing_2", "Writing_3", "Writing_4")),
counts = as.integer(c(54, 171, 198, 51)))
Meaning that when I collapse the criteria columns it sums the counts:
region criteria counts
North Writing_1 54
South Writing_2 171
East Writing_3 198
West Writing_4 51
I have tried using forcats::fct_collapse and forcats::recode()but to no avail - I'm positive I'm just not doing it right. Thank you in advance for any assistance you can provide.
You can think about what exactly you're trying to do to change factor levels—fct_collapse would manually collapse several levels into one level, and fct_recode would manually change the labels of individual levels. What you're trying to do is change all the labels based on applying some function, in which case fct_relabel is appropriate.
You can write out an anonymous function when you call fct_relabel, or just pass it the name of a function and that function's argument(s). In this case, you can use stringr::str_remove to find and remove a regex pattern, and regex such as _[a-z]$ to remove any underscore and then lowercase letter that appear at the end of a string. That way it should scale well with your real data, but you can adjust it if not.
library(tidyverse)
...
new_crits <- demo %>%
mutate(crit_no_digits = fct_relabel(criteria, str_remove, "_[a-z]$"))
new_crits
#> region criteria counts crit_no_digits
#> 1 North Writing_1_a 18 Writing_1
#> 2 South Writing_2_a 27 Writing_2
#> 3 East Writing_3_a 99 Writing_3
#> 4 West Writing_4_a 42 Writing_4
#> 5 North Writing_1_b 36 Writing_1
#> 6 South Writing_2_b 144 Writing_2
#> 7 East Writing_3_b 99 Writing_3
#> 8 West Writing_4_b 9 Writing_4
Verifying that this new variable has only the levels you want:
levels(new_crits$crit_no_digits)
#> [1] "Writing_1" "Writing_2" "Writing_3" "Writing_4"
And then summarizing based on that new factor:
new_crits %>%
group_by(crit_no_digits) %>%
summarise(counts = sum(counts))
#> # A tibble: 4 x 2
#> crit_no_digits counts
#> <fct> <int>
#> 1 Writing_1 54
#> 2 Writing_2 171
#> 3 Writing_3 198
#> 4 Writing_4 51
Created on 2018-11-04 by the reprex package (v0.2.1)
A dplyr solution using regular expressions:
demo %>%
mutate(criteria = gsub("(_a)|(_b)", "", criteria)) %>%
group_by(region, criteria) %>%
summarize(counts = sum(counts)) %>%
arrange(criteria) %>%
as.data.frame
region criteria counts
1 North Writing_1 54
2 South Writing_2 171
3 East Writing_3 198
4 West Writing_4 51

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources