I am having trouble writing a formula in R that allows me to output only rows that contain "N/A". I assuming filter_all would be included since this would be applied to all of the columns in the dataset but please let me know!
filter_all is deprecated. We can use filter with if_all
library(dplyr)
df1 %>%
filter(if_all(everything(), is.na))
If we are using the penguins dataset, not all columns have NAs
library(palmerpenguins)
data(penguins)
> colSums(is.na(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 0 2 2 2 2 11
year
0
i.e. 'species', 'island', 'year' have 0 NAs, so the above code with if_all returns 0 rows as a single row doesn't have all NA for all the columns. We may need if_any
penguins %>%
filter(if_any(everything(), is.na))
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen NA NA NA NA <NA> 2007
2 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
3 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
4 Adelie Torgersen 37.8 17.1 186 3300 <NA> 2007
5 Adelie Torgersen 37.8 17.3 180 3700 <NA> 2007
6 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
7 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
8 Gentoo Biscoe 46.2 14.4 214 4650 <NA> 2008
9 Gentoo Biscoe 47.3 13.8 216 4725 <NA> 2009
10 Gentoo Biscoe 44.5 15.7 217 4875 <NA> 2009
11 Gentoo Biscoe NA NA NA NA <NA> 2009
Or if we want to check columns where there are at least one NA and returns the rows where they are all NA
penguins %>%
filter(if_all(where(~ any(is.na(.x))), is.na))
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen NA NA NA NA <NA> 2007
2 Gentoo Biscoe NA NA NA NA <NA> 2009
Related
I would like to use R tibbles in VS code but am seeing odd character formatting in tibble-output.
Take the penguins dataset from the palmgerpenguins package. The raw .csv looks like this:
"","species","island","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","sex","year"
"1","Adelie","Torgersen",39.1,18.7,181,3750,"male",2007
"2","Adelie","Torgersen",39.5,17.4,186,3800,"female",2007
"3","Adelie","Torgersen",40.3,18,195,3250,"female",2007
"4","Adelie","Torgersen",NA,NA,NA,NA,NA,2007
"5","Adelie","Torgersen",36.7,19.3,193,3450,"female",2007
"6","Adelie","Torgersen",39.3,20.6,190,3650,"male",2007
When using R with VS Code, the output looks likes this:
library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA NA 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
This issue is only present on my work computer. My personal computer prints the tibble in VS code with the correct formatting.
I suspect the issue revolves around character encoding but I'm not sure what setting needs to be changed. My encoding settings in VS code are shown below. Any guidance on what features would need to be changed is much appreciated.
I have a data frame such as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54
I have to copy the values of 'pp' column to 'nh' based on 'id' in other rows with that 'id'. my target data frame is as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA 1 0 0 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
I apprecite it if anybody share his/her comment with me.
Best Regards
Another option using na.locf:
df <- read.table(text="id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54", header=TRUE)
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <int> <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 0 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)
library(tidyverse)
df <- read_table("id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54")
df %>%
group_by(id) %>%
fill(PP:nh, .direction = 'updown')
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 NA 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)
I am struggling with the tidyverse package. I'm using the mpg dataset from R to display the issue that I'm facing (ignore if the relationships are not relevant, it is just for the sake of explaining my problem).
What I'm trying to do is to obtain the average "displ" grouped by manufacturer and year AND at the same time (and this is what I can't figure out), have several columns for each of the fuel types variable (i.e.: a column for the mean of diesel, a column for the mean of petrol, etc.).
This is the first part of the code and I'm new to R so I really don't know what do I need to add...
mpg %>%
group_by(manufacturer, year) %>%
summarize(Mean. = mean(c(displ)))
# A tibble: 30 × 3
# Groups: manufacturer [15]
manufacturer year Mean.
<chr> <int> <dbl>
1 audi 1999 2.36
2 audi 2008 2.73
3 chevrolet 1999 4.97
4 chevrolet 2008 5.12
5 dodge 1999 4.32
6 dodge 2008 4.42
7 ford 1999 4.45
8 ford 2008 4.66
9 honda 1999 1.6
10 honda 2008 1.85
# … with 20 more rows
Any help is appreciated, thank you.
Perhaps, we need to reshape into 'wide'
library(dplyr)
library(tidyr)
mpg %>%
select(manufacturer, year, fl, displ) %>%
pivot_wider(names_from = fl, values_from = displ, values_fn = mean)
-output
# A tibble: 30 x 7
manufacturer year p r e d c
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 2.36 NA NA NA NA
2 audi 2008 2.73 NA NA NA NA
3 chevrolet 2008 6.47 4.49 5.3 NA NA
4 chevrolet 1999 5.7 4.22 NA 6.5 NA
5 dodge 1999 NA 4.32 NA NA NA
6 dodge 2008 NA 4.42 4.42 NA NA
7 ford 1999 NA 4.45 NA NA NA
8 ford 2008 5.4 4.58 NA NA NA
9 honda 1999 1.6 1.6 NA NA NA
10 honda 2008 2 1.8 NA NA 1.8
# … with 20 more rows
penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(penguins,toupper)
this is code which is causing error, can someone solve the problem
It's implied that the first argument of rename_with is what has been piped to it, so you don't need to pass penguins as the first argument:
penguins %>%
select(species,island,sex) %>%
rename(island_new=island) %>%
rename_with(toupper)
# A tibble: 344 x 3
SPECIES ISLAND_NEW SEX
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen NA
5 Adelie Torgersen female
6 Adelie Torgersen male
7 Adelie Torgersen female
8 Adelie Torgersen male
9 Adelie Torgersen NA
10 Adelie Torgersen NA
I often need to rescale time series relative to their value at a certain baseline time (usually as a percent of the baseline). Here's an example.
> library(dplyr)
> library(magrittr)
> library(tibble)
> library(tidyr)
# [messages from package imports snipped]
> set.seed(42)
> mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
> usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
> table <- rbind(mexico, usa)
> table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 11.4 19.9
2 2001 Mexico 10.4 22.5
3 2002 Mexico 12.4 21.9
4 2003 Mexico 13.6 25.0
5 2004 Mexico 14.4 23.9
6 2000 USA 31.3 40.6
7 2001 USA 33.3 40.7
8 2002 USA 30.6 39.3
9 2003 USA 32.7 40.6
10 2004 USA 33.9 45.3
I want to scale A and B to express each value as a percent of the country-specific 2001 value (i.e., the A and B entries in rows 2 and 7 should be 100). My way of doing this is somewhat roundabout and awkward: extract the baseline values into a separate table, merge them back into a separate column in the main table, and then compute scaled values, with annoying intermediate gathering and spreading to avoid specifying the column names of each time series (real data sets can have far more than two value columns). Is there a better way to do this, ideally with a single short pipeline?
> long_table <- table %>% gather(variable, value, -Year, -Country)
> long_table
# A tibble: 20 x 4
Year Country variable value
<int> <chr> <chr> <dbl>
1 2000 Mexico A 11.4
2 2001 Mexico A 10.4
#[remaining tibble printout snipped]
> baseline_table <- long_table %>%
filter(Year == 2001) %>%
select(-Year) %>%
rename(baseline=value)
> baseline_table
# A tibble: 4 x 3
Country variable baseline
<chr> <chr> <dbl>
1 Mexico A 10.4
2 USA A 33.3
3 Mexico B 22.5
4 USA B 40.7
> normalized_table <- long_table %>%
inner_join(baseline_table) %>%
mutate(value=100*value/baseline) %>%
select(-baseline) %>%
spread(variable, value) %>%
arrange(Country, Year)
Joining, by = c("Country", "variable")
> normalized_table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2001 Mexico 100. 100
3 2002 Mexico 118. 97.3
4 2003 Mexico 131. 111.
5 2004 Mexico 138. 106.
6 2000 USA 94.0 99.8
7 2001 USA 100 100
8 2002 USA 92.0 96.6
9 2003 USA 98.3 99.6
10 2004 USA 102. 111.
My second attempt was to use transform, but this failed because transform doesn't seem to recognize dplyr groups, and it would be suboptimal even if it worked because it requires me to know that 2001 is the second year in the time series.
> table %>%
arrange(Country, Year) %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
transform(norm=value*100/value[2])
Year Country variable value norm
1 2000 Mexico A 11.37096 108.9663
2 2001 Mexico A 10.43530 100.0000
3 2002 Mexico A 12.36313 118.4741
4 2003 Mexico A 13.63286 130.6418
5 2004 Mexico A 14.40427 138.0340
6 2000 USA A 31.30487 299.9901
7 2001 USA A 33.28665 318.9811
8 2002 USA A 30.61114 293.3422
9 2003 USA A 32.72121 313.5627
10 2004 USA A 33.86668 324.5395
11 2000 Mexico B 19.89388 190.6402
12 2001 Mexico B 22.51152 215.7247
13 2002 Mexico B 21.90534 209.9157
14 2003 Mexico B 25.01842 239.7480
15 2004 Mexico B 23.93729 229.3876
16 2000 USA B 40.63595 389.4085
17 2001 USA B 40.71575 390.1732
18 2002 USA B 39.34354 377.0235
19 2003 USA B 40.55953 388.6762
20 2004 USA B 45.32011 434.2961
It would be nice for this to be more scalable, but here's a simple solution. You can refer to A[Year == 2001] inside mutate, much as you might do table$A[table$Year == 2001] in base R. This lets you scale against your baseline of 2001 or whatever other year you might need.
Edit: I was missing a group_by to ensure that values are only being scaled against other values in their own group. The "sanity check" (that I clearly didn't do) is that values for Mexico in 2001 should have a scaled value of 1, and same for USA and any other countries.
library(tidyverse)
set.seed(42)
mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
table <- rbind(mexico, usa)
table %>%
group_by(Country) %>%
mutate(A_base2001 = A / A[Year == 2001], B_base2001 = B / B[Year == 2001])
#> # A tibble: 10 x 6
#> # Groups: Country [2]
#> Year Country A B A_base2001 B_base2001
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 Mexico 11.4 19.9 1.09 0.884
#> 2 2001 Mexico 10.4 22.5 1 1
#> 3 2002 Mexico 12.4 21.9 1.18 0.973
#> 4 2003 Mexico 13.6 25.0 1.31 1.11
#> 5 2004 Mexico 14.4 23.9 1.38 1.06
#> 6 2000 USA 31.3 40.6 0.940 0.998
#> 7 2001 USA 33.3 40.7 1 1
#> 8 2002 USA 30.6 39.3 0.920 0.966
#> 9 2003 USA 32.7 40.6 0.983 0.996
#> 10 2004 USA 33.9 45.3 1.02 1.11
Created on 2018-05-23 by the reprex package (v0.2.0).
Inspired by Camille's answer, I found one simple approach that that scales well:
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
spread(variable, value)
# A tibble: 10 x 4
# Groups: Country [2]
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2000 USA 94.0 99.8
3 2001 Mexico 100. 100
4 2001 USA 100 100
5 2002 Mexico 118. 97.3
6 2002 USA 92.0 96.6
7 2003 Mexico 131. 111.
8 2003 USA 98.3 99.6
9 2004 Mexico 138. 106.
10 2004 USA 102. 111.
Preserving the the original values alongside the scaled ones takes more work. Here are two approaches. One of them uses an extra gather call to produce two variable-name columns (one indicating the series name, the other marking original or scaled), then unifying them into one column and reformatting.
table %>%
gather(variable, original, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(scaled=100*original/original[Year == 2001]) %>%
gather(scaled, value, -Year, -Country, -variable) %>%
unite(variable_scaled, variable, scaled, sep='_') %>%
mutate(variable_scaled=gsub("_original", "", variable_scaled)) %>%
spread(variable_scaled, value)
# A tibble: 10 x 6
# Groups: Country [2]
Year Country A A_scaled B B_scaled
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 11.4 109. 19.9 88.4
2 2000 USA 31.3 94.0 40.6 99.8
3 2001 Mexico 10.4 100. 22.5 100
4 2001 USA 33.3 100 40.7 100
5 2002 Mexico 12.4 118. 21.9 97.3
6 2002 USA 30.6 92.0 39.3 96.6
7 2003 Mexico 13.6 131. 25.0 111.
8 2003 USA 32.7 98.3 40.6 99.6
9 2004 Mexico 14.4 138. 23.9 106.
10 2004 USA 33.9 102. 45.3 111.
A second equivalent approach creates a new table with the columns scaled "in place" and then merges it back into with the original one.
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
ungroup() %>%
mutate(variable=paste(variable, 'scaled', sep='_')) %>%
spread(variable, value) %>%
inner_join(table)
Joining, by = c("Year", "Country")
# A tibble: 10 x 6
Year Country A_scaled B_scaled A B
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 109. 88.4 11.4 19.9
2 2000 USA 94.0 99.8 31.3 40.6
3 2001 Mexico 100. 100 10.4 22.5
4 2001 USA 100 100 33.3 40.7
5 2002 Mexico 118. 97.3 12.4 21.9
6 2002 USA 92.0 96.6 30.6 39.3
7 2003 Mexico 131. 111. 13.6 25.0
8 2003 USA 98.3 99.6 32.7 40.6
9 2004 Mexico 138. 106. 14.4 23.9
10 2004 USA 102. 111. 33.9 45.3
It's possible to replace the final inner_join with arrange(County, Year) %>% select(-Country, -Year) %>% bind_cols(table), which may perform better for some data sets, though it orders the columns suboptimally.