To add an indicator variable to a data frame - r

I wanted to add a new column called "Missing" to penguins data frame, so for any rows with at least an NA, I want to have TRUE in the Missing column and FALSE otherwise. My code just added all FALSE to the new column. How do I fix this? Thank you.
## install.packages("palmerpenguins")
library(palmerpenguins)
View(penguins)
penguins_m <- penguins %>%
mutate(Missing = ifelse(is.na(.),T,F))
species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex year Missing[,"speci… [,"island"] [,"bill_length_… [,"bill_depth_m…
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <lgl> <lgl> <lgl> <lgl>
1 Adelie Torge… 39.1 18.7 181 3750 male 2007 FALSE FALSE FALSE FALSE
2 Adelie Torge… 39.5 17.4 186 3800 fema… 2007 FALSE FALSE FALSE FALSE
3 Adelie Torge… 40.3 18 195 3250 fema… 2007 FALSE FALSE FALSE FALSE
4 Adelie Torge… NA NA NA NA NA 2007 FALSE FALSE TRUE TRUE
5 Adelie Torge… 36.7 19.3 193 3450 fema… 2007 FALSE FALSE FALSE FALSE
6 Adelie Torge… 39.3 20.6 190 3650 male 2007 FALSE FALSE FALSE FALSE

This is what the complete.cases function does, so you can do
penguins_m <- penguins %>%
mutate(Missing = !complete.cases(.))
A couple comments on your attempt - when you have a function or test that returns TRUE or FALSE, you don't need to wrap it in ifelse to get a TRUE/FALSE result.
Your attempt doesn't work because is.na(x) doesn't return 1 value per row if x is a data frame - it actually returns a matrix of TRUE/FALSE values showing whether each individual value is missing or not. So, if we didn't know about complete.cases we could use it like this:
penguins_m <- penguins %>%
mutate(Missing = rowSums(is.na(.)) > 0)

Related

R tibbles not printing correctly in VS Code

I would like to use R tibbles in VS code but am seeing odd character formatting in tibble-output.
Take the penguins dataset from the palmgerpenguins package. The raw .csv looks like this:
"","species","island","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","sex","year"
"1","Adelie","Torgersen",39.1,18.7,181,3750,"male",2007
"2","Adelie","Torgersen",39.5,17.4,186,3800,"female",2007
"3","Adelie","Torgersen",40.3,18,195,3250,"female",2007
"4","Adelie","Torgersen",NA,NA,NA,NA,NA,2007
"5","Adelie","Torgersen",36.7,19.3,193,3450,"female",2007
"6","Adelie","Torgersen",39.3,20.6,190,3650,"male",2007
When using R with VS Code, the output looks likes this:
library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA NA 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
This issue is only present on my work computer. My personal computer prints the tibble in VS code with the correct formatting.
I suspect the issue revolves around character encoding but I'm not sure what setting needs to be changed. My encoding settings in VS code are shown below. Any guidance on what features would need to be changed is much appreciated.

How to tidy a deeply nested json-File in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
There is a json-File I would like to analyze:
https://dam-api.bfs.admin.ch/hub/api/dam/assets/21364129/master
It has data about four different referendums (all from the same date), all the results for each region (kanton) and on community-level (gemeinde). I tried to make a tidy df out of this, but I don't even understand the exact syntax of the unnest-command (cols?).
After a lot of try and error, I can get the data for a specific kanton, for example like this:
flatten(df$schweiz$vorlagen$kantone[[4]]$gemeinden[[13]])
But there must be some way to get this data for all 26 cantons, and of course with the information to which one of the four referendum the data belongs. I guess that some tools from the purrr-package might be helpful, but all the tutorials i read so far have been a mistery to me.
So, in short: Is there a way to get this file tidy without hours of manual work?
Converting such a large and complex structure into a data frame requires specific knowledge of what you want your final output to be, and you really need to tailor your code to do it. In this case, I assume you want a list containing 4 data frames (one for each referendum), where each row gives the kantone, the gemeinden, and the various fields indicating the result from each gemeinden. This is complex, but something like the following should work:
url <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/21364129/master"
json <- jsonlite::read_json(url)
result <- lapply(json$schweiz$vorlagen, function(x) {
dplyr::as_tibble(do.call(rbind, lapply(x$kantone, function(y) {
k <- y$geoLevelname
do.call(rbind, lapply(y$gemeinden, function(z) {
cbind(data.frame(kantone = k, gemeinden = z$geoLevelname),
as.data.frame(z$resultat))
}))
})))
})
So your result looks like this:
result
#> [[1]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 25.4 194
#> 2 Zürich Affoltern am Albis TRUE 20.4 661
#> 3 Zürich Bonstetten TRUE 17.9 345
#> 4 Zürich Hausen am Albis TRUE 20.9 286
#> 5 Zürich Hedingen TRUE 19.3 269
#> 6 Zürich Kappel am Albis TRUE 24.1 93
#> 7 Zürich Knonau TRUE 21.6 161
#> 8 Zürich Maschwanden TRUE 21.1 49
#> 9 Zürich Mettmenstetten TRUE 19.3 368
#> 10 Zürich Obfelden TRUE 21.5 325
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[2]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 60.3 470
#> 2 Zürich Affoltern am Albis TRUE 61.2 2014
#> 3 Zürich Bonstetten TRUE 59.8 1166
#> 4 Zürich Hausen am Albis TRUE 57.7 796
#> 5 Zürich Hedingen TRUE 58.5 826
#> 6 Zürich Kappel am Albis TRUE 49.6 192
#> 7 Zürich Knonau TRUE 56.8 431
#> 8 Zürich Maschwanden TRUE 52.6 123
#> 9 Zürich Mettmenstetten TRUE 58.3 1119
#> 10 Zürich Obfelden TRUE 57.2 868
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[3]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 39.6 301
#> 2 Zürich Affoltern am Albis TRUE 35.7 1150
#> 3 Zürich Bonstetten TRUE 36.2 697
#> 4 Zürich Hausen am Albis TRUE 39.5 531
#> 5 Zürich Hedingen TRUE 37.1 509
#> 6 Zürich Kappel am Albis TRUE 42.9 162
#> 7 Zürich Knonau TRUE 36.7 273
#> 8 Zürich Maschwanden TRUE 37.1 86
#> 9 Zürich Mettmenstetten TRUE 39.2 742
#> 10 Zürich Obfelden TRUE 34.4 513
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[4]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 39.0 300
#> 2 Zürich Affoltern am Albis TRUE 40.8 1321
#> 3 Zürich Bonstetten TRUE 42.8 826
#> 4 Zürich Hausen am Albis TRUE 41.6 570
#> 5 Zürich Hedingen TRUE 44.1 615
#> 6 Zürich Kappel am Albis TRUE 35.5 136
#> 7 Zürich Knonau TRUE 38.6 290
#> 8 Zürich Maschwanden TRUE 42.7 100
#> 9 Zürich Mettmenstetten TRUE 39.6 757
#> 10 Zürich Obfelden TRUE 40.5 610
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
If you want them all in one big data frame, you can add a number or name column to indicate which referendum you are referring to, then bind all the data frames together.
Created on 2022-04-16 by the reprex package (v2.0.1)

How to show rows that only contain N/A values in R?

I am having trouble writing a formula in R that allows me to output only rows that contain "N/A". I assuming filter_all would be included since this would be applied to all of the columns in the dataset but please let me know!
filter_all is deprecated. We can use filter with if_all
library(dplyr)
df1 %>%
filter(if_all(everything(), is.na))
If we are using the penguins dataset, not all columns have NAs
library(palmerpenguins)
data(penguins)
> colSums(is.na(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 0 2 2 2 2 11
year
0
i.e. 'species', 'island', 'year' have 0 NAs, so the above code with if_all returns 0 rows as a single row doesn't have all NA for all the columns. We may need if_any
penguins %>%
filter(if_any(everything(), is.na))
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen NA NA NA NA <NA> 2007
2 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
3 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
4 Adelie Torgersen 37.8 17.1 186 3300 <NA> 2007
5 Adelie Torgersen 37.8 17.3 180 3700 <NA> 2007
6 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
7 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
8 Gentoo Biscoe 46.2 14.4 214 4650 <NA> 2008
9 Gentoo Biscoe 47.3 13.8 216 4725 <NA> 2009
10 Gentoo Biscoe 44.5 15.7 217 4875 <NA> 2009
11 Gentoo Biscoe NA NA NA NA <NA> 2009
Or if we want to check columns where there are at least one NA and returns the rows where they are all NA
penguins %>%
filter(if_all(where(~ any(is.na(.x))), is.na))
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen NA NA NA NA <NA> 2007
2 Gentoo Biscoe NA NA NA NA <NA> 2009

Error: Can't convert a `tbl_df/tbl/data.frame` object to function

penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(penguins,toupper)
this is code which is causing error, can someone solve the problem
It's implied that the first argument of rename_with is what has been piped to it, so you don't need to pass penguins as the first argument:
penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(toupper)
# A tibble: 344 x 3
SPECIES ISLAND_NEW SEX
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen NA
5 Adelie Torgersen female
6 Adelie Torgersen male
7 Adelie Torgersen female
8 Adelie Torgersen male
9 Adelie Torgersen NA
10 Adelie Torgersen NA

Dynamically create and evaluate function in R

I am trying to dynamically create and evaluate a function from a string input and am hung up, again, on meta-programming/evaluation (https://adv-r.hadley.nz/metaprogramming.html). I have a feeling this is answered on SO, but I searched and wasn't able to figure out the solution looking through other posts; however, if there is an existing answer, please let me know and flag as duplicate. Thank you so much for your time and help! Below is a reprex of the issue.
library(dplyr)
library(purrr)
library(rlang)
library(palmerpenguins)
# Create data to join with penguins
penguin_colors <-
tibble(
species = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Create function to do specified join and print join type
foo <- function(JOINTYPE) {
# DOESN'T RUN
# JOINTYPE_join(penguins, penguin_colors, by = "species")
# call2(sym(paste0(JOINTYPE, "_join")), x = penguins, y = penguin_colors, by = "species")
print(JOINTYPE)
}
# Desired behavior of foo when JOINTYPE == "inner"
inner_join(penguins, penguin_colors, by = "species")
#> # A tibble: 344 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
#> # color <chr>
print("inner")
#> [1] "inner"
# Use function in for loop
for (JOINTYPE in c("inner", "left", "right")) {
foo(JOINTYPE)
}
#> [1] "inner"
#> [1] "left"
#> [1] "right"
# Use function in vectorised fashion
walk(c("inner", "left", "right"), foo)
#> [1] "inner"
#> [1] "left"
#> [1] "right"
Created on 2020-10-27 by the reprex package (v0.3.0)
One option is to use get() to retrieve the appropriate function:
join <- function(JOINTYPE) {
get( paste0(JOINTYPE, "_join") )
}
join("inner")(penguins, penguin_colors, by="species")
If using rlang, the more appropriate function here is rlang::exec:
join2 <- function(JOINTYPE, ...) {
rlang::exec( paste0(JOINTYPE, "_join"), ... )
}
join2("inner", penguins, penguin_colors, by="species")

Resources