Replace all NAs with -1 in r with dplyr - r

I'm currently working with the tidyverse in R. After using mice to impute NAs some of the columns still have NAs due to the fact that they are poorly populated to begin with (I believe). As a final check I want to replace all of the remaining NAs with -1. It usually just happens in a single column depending on the dataset. Long story short I'm doing the same process on multiple locations and sometimes Col1 is populated wonderfully in region A, but badly in region B.
Currently I'm doing the following.
Clean.df <- df %>% mutate(
coalesce(Col1 ,-1),
coalesce(Col2, -1),
....)
And I'm doing that for 31 columns which makes me think there must be an easier way. I attempted read the coalesce documentation and tried to replace it with the name of the data frame, no luck.
Thanks for the insight.

Since you didn't provide any data, I am using a sample data frame to show how every NA in a data frame can be replaced with a given value (-1):
library(tidyverse)
# creating example dataset
example_df <- ggplot2::msleep
# looking at NAs
example_df
#> # A tibble: 83 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA
#> 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 NA
#> 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA
#> 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133
#> 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667
#> 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767
#> 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383
#> 8 Vesp~ Calo~ <NA> Rode~ <NA> 7 NA NA
#> 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333
#> 10 Roe ~ Capr~ herbi Arti~ lc 3 NA NA
#> # ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>
# replacing NAs with -1
purrr::map_dfr(.x = example_df,
.f = ~ tidyr::replace_na(data = ., -1))
#> # A tibble: 83 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1
#> 2 Owl ~ Aotus omni Prim~ -1 17 1.8 -1
#> 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1
#> 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133
#> 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667
#> 6 Thre~ Brad~ herbi Pilo~ -1 14.4 2.2 0.767
#> 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383
#> 8 Vesp~ Calo~ -1 Rode~ -1 7 -1 -1
#> 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333
#> 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1
#> # ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>
Created on 2018-10-10 by the reprex package (v0.2.1)

An alternative to Indrajeet's answer that is pure dplyr. Using Indrajeet's recommendation of ggplot2::msleep:
library(dplyr)
ggplot2::msleep %>%
mutate_at(vars(sleep_rem, sleep_cycle), ~ if_else(is.na(.), -1, .))
# # A tibble: 83 x 11
# name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
# <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1 11.9
# 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 -1 7
# 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1 9.6
# 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
# 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
# 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767 9.6
# 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383 15.3
# 8 Vesp~ Calo~ <NA> Rode~ <NA> 7 -1 -1 17
# 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9
# 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1 21
# # ... with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
If you want the nuclear option over all columns (numeric and character), then use:
ggplot2::msleep %>%
mutate_all(~ ifelse(is.na(.), -1, .))
# # A tibble: 83 x 11
# name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
# <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1 11.9
# 2 Owl ~ Aotus omni Prim~ -1 17 1.8 -1 7
# 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1 9.6
# 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
# 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
# 6 Thre~ Brad~ herbi Pilo~ -1 14.4 2.2 0.767 9.6
# 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383 15.3
# 8 Vesp~ Calo~ -1 Rode~ -1 7 -1 -1 17
# 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9
# 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1 21
# # ... with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
Note that I'm no longer using dplyr::if_else, since the function needs to be versatile with (or ignorant of) the different types. Since base::ifelse will happily/silently(/sloppily?) convert, we're good.

Related

Replace values in a table in R

I have this dataset
Longitude Latitude Radius Site_Type
<dbl> <dbl> <dbl> <chr>
1 -102. 1.5 5 OBS
2 -80.0 27.1 5 OBS
3 -158. 21.5 1 FEE;OBS
4 -81.6 3.98 1 FEE;OBS;NA
5 -87.0 5.50 1 OBS
6 -90.7 -0.55 1 FEE;OBS
7 -110. 24.7 1 FEE;OBS;NA
8 -89.5 28.4 1 OBS
9 -91.8 1.38 1 FEE;OBS
I want to replace NA by OBS I tried using replace() but nothing changed...
NA is character here so str_replace replace might work for you?
library(tidyverse)
df1 %>%
mutate(Site_Type = str_replace(Site_Type, "NA", "OBS"))
# Longitude Latitude Radius Site_Type
# 1 -102.0 1.50 5 OBS
# 2 -80.0 27.10 5 OBS
# 3 -158.0 21.50 1 FEE;OBS
# 4 -81.6 3.98 1 FEE;OBS;OBS
# 5 -87.0 5.50 1 OBS
# 6 -90.7 -0.55 1 FEE;OBS
# 7 -110.0 24.70 1 FEE;OBS;OBS
# 8 -89.5 28.40 1 OBS
# 9 -91.8 1.38 1 FEE;OBS
We can use sub in base R
df1$Site_Type <- sub("NA", "OBS", df1$Site_Type)

Sort a dataframe according to characters in R [duplicate]

This question already has answers here:
R Sort strings according to substring
(2 answers)
Closed 2 years ago.
I got the dataframe (code) and I I want to sort it according to combName in a numerical order.
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14.0
# ... with 1,098 more rows
Afterwards the left column should look like:
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-2 ...
3 20-3 ...
4 20-4 ...
5 20-5 ...
...
10 20-10 14.7
# ... with 1,098 more rows
It do not know what I can do to reach this format.
Does this work:
library(dplyr)
library(tidyr)
df
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-102 25.4
2 20-100 21.2
3 20-101 17.6
4 20-105 24.3
5 20-10 14.7
6 20-103 46.3
7 20-104 68.7
8 20-1 8.05
9 20-106 46.3
10 20-107 14
df %>% separate(combName, into = c('1','2'), sep = '-', remove = F) %>%
type.convert(as.is = T) %>% arrange(`1`,`2`) %>% select(-c(`1`,`2`))
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14

Summarizing using function requiring multiple parameters in R

I'm trying to get the area under the curve of some data for each run of a set of simulation runs. My data is of the form:
run year data1 data2 data3
--- ---- ----- ----- -----
1 2001 2.3 45.6 30.2
1 2002 2.4 35.4 23.4
1 2003 2.6 45.6 23.6
2 2001 2.3 45.6 30.2
2 2002 2.4 35.4 23.4
2 2003 2.6 45.6 23.6
3 2001 ... and so on
So, I'd like to get the area under the curve for each data trace for run 1, run 2, ... where the x axis is always the year column and the y axis is each data column. So, as output I want something like:
run Data1_auc Data2_auc Data3_auc
--- --------- --------- ---------
1 4.5 6.7 27.5
2 3.4 6.8 35.4
3 4.5 7.8 45.6
(Theses are not actual areas for the data above)
I want to use the pracma package 'trapz' function to compute the area which takes x and y values: trapz(x, y) where x=year in my case and y=Data column.
I've tried
dataCols <- colnames(myData %>% select(-c("run","year"))
myData <- group_by(run) %>% summarize_at(vars(dataCols), list(auc = trapz(year,.)))
but I can't get it to work without error. I've tried different variations on this, but can't seem it get it right.
Is this possible? If so, how do I do it?
library(dplyr)
library(pracma)
set.seed(1)
df <- tibble(
run = rep(1:3, each = 3),
year = rep(2001:2003, 3),
data1 = runif(9, 2, 3),
data2 = runif(9, 30, 50),
data3 = runif(9, 20, 40)
)
df
#> # A tibble: 9 x 5
#> run year data1 data2 data3
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2001 2.27 31.2 27.6
#> 2 1 2002 2.37 34.1 35.5
#> 3 1 2003 2.57 33.5 38.7
#> 4 2 2001 2.91 43.7 24.2
#> 5 2 2002 2.20 37.7 33.0
#> 6 2 2003 2.90 45.4 22.5
#> 7 3 2001 2.94 40.0 25.3
#> 8 3 2002 2.66 44.4 27.7
#> 9 3 2003 2.63 49.8 20.3
df %>%
group_by(run) %>%
summarise_at(vars(starts_with("data")), list(auc = ~trapz(year, .)))
#> # A tibble: 3 x 4
#> run data1_auc data2_auc data3_auc
#> <int> <dbl> <dbl> <dbl>
#> 1 1 4.79 66.5 68.7
#> 2 2 5.10 82.3 56.4
#> 3 3 5.45 89.2 50.5

How to separate each column name of a matrix by the +

I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R.
For example:
data$age is the response variable
X is the design matrix whose column names are, for example, data$education and data$wage.
The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:
best_model <- lm(data$age ~ paste(colnames(x[, GA#solution == 1]), sep = "+"))
But it doesn't work.
Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)
library(tidyverse)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
#> # ... with 224 more rows
Select
dplyr::select() subsets column.
mpg %>%
select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
lm(hwy ~ ., data = .)
#>
#> Call:
#> lm(formula = hwy ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) manufacturerchevrolet manufacturerdodge
#> 2.65526 -1.08632 -2.55442
#> manufacturerford manufacturerhonda manufacturerhyundai
#> -2.29897 -2.98863 -0.94980
#> manufacturerjeep manufacturerland rover manufacturerlincoln
#> -3.36654 -1.87179 -1.10739
#> manufacturermercury manufacturernissan manufacturerpontiac
#> -2.64828 -2.44447 0.75427
#> manufacturersubaru manufacturertoyota manufacturervolkswagen
#> -3.04204 -2.73963 -1.62987
#> displ cyl cty
#> -0.03763 0.06134 1.33805
Denote that -col.name exclude that column. %>% enables formula to use . notation.
Tidyselect
Lots of data sets group their columns using underscore.
nycflights13::flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance)
#> # A tibble: 336,776 x 5
#> sched_dep_time sched_arr_time dep_delay arr_delay distance
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 515 819 2 11 1400
#> 2 529 830 4 20 1416
#> 3 540 850 2 33 1089
#> 4 545 1022 -1 -18 1576
#> 5 600 837 -6 -25 762
#> 6 558 728 -4 12 719
#> 7 600 854 -5 19 1065
#> 8 600 723 -3 -14 229
#> 9 600 846 -3 -8 944
#> 10 600 745 -2 8 733
#> # ... with 336,766 more rows
After that, just %>% lm().
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance) %>%
lm(dep_delay ~ ., data = .)
#>
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) sched_dep_time sched_arr_time arr_delay
#> -0.151408 0.002737 0.000951 0.816684
#> distance
#> 0.001859

Webscraping in R - commented out table [duplicate]

This question already has an answer here:
Not able to scrape a second table within a page using rvest
(1 answer)
Closed 4 years ago.
I'm trying to webscrape the final table in https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml
i.e. the "MLB Detailed Standings"
My R code is as follows:
library(XML)
library(httr)
library(plyr)
library(stringr)
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", 2015, "-standings.shtml")
tab <- GET(url)
data <- readHTMLTable(rawToChar(tab$content))
however the it does not seem to pickup the table I want. Looking at the source code it seems as though the table is commented out somehow?
Any help would be great
From the answer MrFlick linked:
library(XML)
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml")
alt_tables <- xml2::xml_find_all(page,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
tbl <- alt_tables[[2]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 31 x 23
Rk Tm Lg G W L `W-L%` R RA Rdiff SOS SRS pythWL Luck Inter Home Road ExInn
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr> <chr> <chr> <chr>
1 1 STL NL 162 100 62 0.617 4 3.2 0.8 -0.3 0.5 96-66 4 11-9 55-26 45-36 8-8
2 2 PIT NL 162 98 64 0.605 4.3 3.7 0.6 -0.3 0.3 93-69 5 13-7 53-28 45-36 12-9
3 3 CHC NL 162 97 65 0.599 4.3 3.8 0.5 -0.3 0.2 90-72 7 10-10 49-32 48-33 13-5
4 4 KCR AL 162 95 67 0.586 4.5 4 0.5 0.2 0.7 90-72 5 13-7 51-30 44-37 10-6
5 5 TOR AL 162 93 69 0.574 5.5 4.1 1.4 0.2 1.6 102-60 -9 12-8 53-28 40-41 8-6
6 6 LAD NL 162 92 70 0.568 4.1 3.7 0.4 -0.3 0.1 89-73 3 10-10 55-26 37-44 6-9
7 7 NYM NL 162 90 72 0.556 4.2 3.8 0.4 -0.4 0 89-73 1 9-11 49-32 41-40 9-6
8 8 TEX AL 162 88 74 0.543 4.6 4.5 0.1 0.2 0.4 83-79 5 11-9 43-38 45-36 5-4
9 9 NYY AL 162 87 75 0.537 4.7 4.3 0.4 0.3 0.8 88-74 -1 11-9 45-36 42-39 4-9
10 10 HOU AL 162 86 76 0.531 4.5 3.8 0.7 0.2 0.9 93-69 -7 16-4 53-28 33-48 8-6
# ... with 21 more rows, and 5 more variables: `1Run` <chr>, vRHP <chr>, vLHP <chr>, `≥.500` <chr>, `<.500` <chr>
>

Resources