Selecting rows with rowSums and mutate with dplyr error - r

data:
head(well_being_df2)
# A tibble: 6 x 70
Age Gender EmploymentStatus PWI1 PWI2 PWI3 PWI4 PWI5 PWI6 PWI7 Personality1 Personality2 Personality3
<dbl> <dbl+l> <dbl+lbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
I am selecting a subset of columns and trying to mutate them. I have played
around with the solution provided here but I am getting various errors. I am trying to select the PWI columns, then mutate with rowSums to a new variable called PWI_Index.
This works:
rowSums(select(well_being_df2, contains("PWI")))
[1] 50 32 48 32 58 52 41 51 49 37 50 53 58 47....
[38] 58 60 63 60 63 56 43 30 45 53 45 44 57 55....
[75] 50 55 57 58 57 58 58 58 62 62 44 59 58....
But then when I try to mutate:
mutate(well_being_df2, x = rowSums(select(well_being_df2,
contains("PWI"))))
Which outputs/selects the entire set of columns not the "PWI" columns. Example:
# A tibble: 169 x 71
Age Gender EmploymentStatus PWI1 PWI2 PWI3 PWI4 PWI5 PWI6 PWI7 Personality1 Personality2 Personality3
<dbl> <dbl+l> <dbl+lbl> <dbl+> <dbl+> <dbl+> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 22 2 3 8 8 6 8 8 6 6 1 1 1
2 20 2 1 4 6 1 8 8 4 1 4 5 4
It selects the entire dataframe instead of the selected rowSums of "PWI". Using [.4:10] doesnt work either. Any other solution and I am getting the following error:
select(well_being_df2[.4:10]) %>%
mutate(PWI_Index = rowSums(.)) %>% left_join(well_being_df2)
Error: Column indexes must be integer, not 0.11, 1.11,...
Plus working through previous examples with:
well_being_df2 %>%
mutate(x = rowSums(select(., contains("PWI")))) %>%
head()
And it takes the entire set of columns like before.

I'm not sure I understand (or can reproduce) your issue.
Here is an example using the iris data that works just fine.
iris %>%
mutate(x = rowSums(select(., contains("Width")))) %>%
head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#1 5.1 3.5 1.4 0.2 setosa 3.7
#2 4.9 3.0 1.4 0.2 setosa 3.2
#3 4.7 3.2 1.3 0.2 setosa 3.4
#4 4.6 3.1 1.5 0.2 setosa 3.3
#5 5.0 3.6 1.4 0.2 setosa 3.8
#6 5.4 3.9 1.7 0.4 setosa 4.3
As you can see x is the sum of columns Sepal.Width and Petal.Width, and is the same as
rowSums(select(iris, contains("Width"))) %>% head()
#[1] 3.7 3.2 3.4 3.3 3.8 4.3

Related

Conditionally replace all records for group_by if condition is met once dplyr ifelse

I am trying to replace all values in nat_locx with the value from the first row in LOCX if multiple conditions are met once or more for id (my group_by() variable).
Here is an example of my data:
id DATE nat_locx LOCX distance loc_age
<fct> <date> <dbl> <dbl> <dbl> <dbl>
6553 2004-06-27 13.5 2 487.90 26
6553 2004-07-14 13.5 13.5 0 43
6553 2004-07-15 13.5 12.5 30 44
6553 2004-07-25 13.5 14.5 44.598 54
6081 2004-07-05 13 14.2 40.249 44
6081 2004-07-20 13 13.8 61.847 49
The way I have tried to do this is like so:
df<-df %>%
group_by(id) %>%
mutate(nat_locx=ifelse(loc_age>25 & loc_age<40 & distance>30, first(LOCX), nat_locx))
However, when I do this, it only replaces the first row with the first value from the LOCX column instead of all the nat_locx values for my group_by variable (id).
Ideally, I'd like this output:
id DATE nat_locx LOCX distance loc_age
<fct> <date> <dbl> <dbl> <dbl> <dbl>
6553 2004-06-27 2 2 487.90 26
6553 2004-07-14 2 13.5 0 43
6553 2004-07-15 2 12.5 30 44
6553 2004-07-25 2 14.5 44.598 54
6081 2004-07-05 13 14.2 40.249 44
6081 2004-07-20 13 13.8 61.847 49
A dplyr solution is preferred.
We could use a classic non vectorized if else statement:
df %>%
group_by(id) %>%
mutate(nat_locx=if (loc_age > 25 &
loc_age < 40 &
distance > 30) {
first(LOCX)
} else {
nat_locx
}
)
id DATE nat_locx LOCX distance loc_age
<int> <chr> <dbl> <dbl> <dbl> <int>
1 6553 2004-06-27 2 2 488. 26
2 6553 2004-07-14 2 13.5 0 43
3 6553 2004-07-15 2 12.5 30 44
4 6553 2004-07-25 2 14.5 44.6 54
5 6081 2004-07-05 13 14.2 40.2 44
6 6081 2004-07-20 13 13.8 61.8 49
We may need replace
df %>%
group_by(id) %>%
mutate(nat_locx =
replace(nat_locx,
loc_age>25 & loc_age<40 & distance>30,
first(LOCX)))

Webscraping in R - commented out table [duplicate]

This question already has an answer here:
Not able to scrape a second table within a page using rvest
(1 answer)
Closed 4 years ago.
I'm trying to webscrape the final table in https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml
i.e. the "MLB Detailed Standings"
My R code is as follows:
library(XML)
library(httr)
library(plyr)
library(stringr)
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", 2015, "-standings.shtml")
tab <- GET(url)
data <- readHTMLTable(rawToChar(tab$content))
however the it does not seem to pickup the table I want. Looking at the source code it seems as though the table is commented out somehow?
Any help would be great
From the answer MrFlick linked:
library(XML)
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml")
alt_tables <- xml2::xml_find_all(page,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
tbl <- alt_tables[[2]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 31 x 23
Rk Tm Lg G W L `W-L%` R RA Rdiff SOS SRS pythWL Luck Inter Home Road ExInn
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr> <chr> <chr> <chr>
1 1 STL NL 162 100 62 0.617 4 3.2 0.8 -0.3 0.5 96-66 4 11-9 55-26 45-36 8-8
2 2 PIT NL 162 98 64 0.605 4.3 3.7 0.6 -0.3 0.3 93-69 5 13-7 53-28 45-36 12-9
3 3 CHC NL 162 97 65 0.599 4.3 3.8 0.5 -0.3 0.2 90-72 7 10-10 49-32 48-33 13-5
4 4 KCR AL 162 95 67 0.586 4.5 4 0.5 0.2 0.7 90-72 5 13-7 51-30 44-37 10-6
5 5 TOR AL 162 93 69 0.574 5.5 4.1 1.4 0.2 1.6 102-60 -9 12-8 53-28 40-41 8-6
6 6 LAD NL 162 92 70 0.568 4.1 3.7 0.4 -0.3 0.1 89-73 3 10-10 55-26 37-44 6-9
7 7 NYM NL 162 90 72 0.556 4.2 3.8 0.4 -0.4 0 89-73 1 9-11 49-32 41-40 9-6
8 8 TEX AL 162 88 74 0.543 4.6 4.5 0.1 0.2 0.4 83-79 5 11-9 43-38 45-36 5-4
9 9 NYY AL 162 87 75 0.537 4.7 4.3 0.4 0.3 0.8 88-74 -1 11-9 45-36 42-39 4-9
10 10 HOU AL 162 86 76 0.531 4.5 3.8 0.7 0.2 0.9 93-69 -7 16-4 53-28 33-48 8-6
# ... with 21 more rows, and 5 more variables: `1Run` <chr>, vRHP <chr>, vLHP <chr>, `≥.500` <chr>, `<.500` <chr>
>

Replace all NAs with -1 in r with dplyr

I'm currently working with the tidyverse in R. After using mice to impute NAs some of the columns still have NAs due to the fact that they are poorly populated to begin with (I believe). As a final check I want to replace all of the remaining NAs with -1. It usually just happens in a single column depending on the dataset. Long story short I'm doing the same process on multiple locations and sometimes Col1 is populated wonderfully in region A, but badly in region B.
Currently I'm doing the following.
Clean.df <- df %>% mutate(
coalesce(Col1 ,-1),
coalesce(Col2, -1),
....)
And I'm doing that for 31 columns which makes me think there must be an easier way. I attempted read the coalesce documentation and tried to replace it with the name of the data frame, no luck.
Thanks for the insight.
Since you didn't provide any data, I am using a sample data frame to show how every NA in a data frame can be replaced with a given value (-1):
library(tidyverse)
# creating example dataset
example_df <- ggplot2::msleep
# looking at NAs
example_df
#> # A tibble: 83 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA
#> 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 NA
#> 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA
#> 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133
#> 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667
#> 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767
#> 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383
#> 8 Vesp~ Calo~ <NA> Rode~ <NA> 7 NA NA
#> 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333
#> 10 Roe ~ Capr~ herbi Arti~ lc 3 NA NA
#> # ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>
# replacing NAs with -1
purrr::map_dfr(.x = example_df,
.f = ~ tidyr::replace_na(data = ., -1))
#> # A tibble: 83 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1
#> 2 Owl ~ Aotus omni Prim~ -1 17 1.8 -1
#> 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1
#> 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133
#> 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667
#> 6 Thre~ Brad~ herbi Pilo~ -1 14.4 2.2 0.767
#> 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383
#> 8 Vesp~ Calo~ -1 Rode~ -1 7 -1 -1
#> 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333
#> 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1
#> # ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>
Created on 2018-10-10 by the reprex package (v0.2.1)
An alternative to Indrajeet's answer that is pure dplyr. Using Indrajeet's recommendation of ggplot2::msleep:
library(dplyr)
ggplot2::msleep %>%
mutate_at(vars(sleep_rem, sleep_cycle), ~ if_else(is.na(.), -1, .))
# # A tibble: 83 x 11
# name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
# <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1 11.9
# 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 -1 7
# 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1 9.6
# 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
# 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
# 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767 9.6
# 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383 15.3
# 8 Vesp~ Calo~ <NA> Rode~ <NA> 7 -1 -1 17
# 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9
# 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1 21
# # ... with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
If you want the nuclear option over all columns (numeric and character), then use:
ggplot2::msleep %>%
mutate_all(~ ifelse(is.na(.), -1, .))
# # A tibble: 83 x 11
# name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
# <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Chee~ Acin~ carni Carn~ lc 12.1 -1 -1 11.9
# 2 Owl ~ Aotus omni Prim~ -1 17 1.8 -1 7
# 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 -1 9.6
# 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
# 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
# 6 Thre~ Brad~ herbi Pilo~ -1 14.4 2.2 0.767 9.6
# 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383 15.3
# 8 Vesp~ Calo~ -1 Rode~ -1 7 -1 -1 17
# 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9
# 10 Roe ~ Capr~ herbi Arti~ lc 3 -1 -1 21
# # ... with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>
Note that I'm no longer using dplyr::if_else, since the function needs to be versatile with (or ignorant of) the different types. Since base::ifelse will happily/silently(/sloppily?) convert, we're good.

Subset Duplicated Values >10

I am looking at a data frame and trying to subset rows that have the same pressure value for more then 5 rows or delete rows that do not have 5 duplicate pressure values...
File Turbidity Pressure
1 3.2 46
2 3.4 46
3 5.4 46
4 3.2 46
5 3.1 46
6 2.3 46
7 2.3 45
8 4.5 45
9 2.3 45
10 3.2 44
11 4.5 44
12 6.5 43
13 3.2 42
14 3.1 41
15 1.2 41
16 2.3 41
17 2.4 41
18 2.1 41
19 1.4 41
25 1.3 41
So basically trying to keep rows that have a pressure of 46 and 41 and delete rows in-between. This is a small portion of my dataset and just need code that will basically keep rows with 5 or more duplicate pressure values and delete others.
Try
library(dplyr)
df %>% group_by(Pressure) %>% filter(n() >= 5)
Which gives:
#Source: local data frame [13 x 3]
#Groups: Pressure
#
# File Turbidity Pressure
#1 1 3.2 46
#2 2 3.4 46
#3 3 5.4 46
#4 4 3.2 46
#5 5 3.1 46
#6 6 2.3 46
#7 14 3.1 41
#8 15 1.2 41
#9 16 2.3 41
#10 17 2.4 41
#11 18 2.1 41
#12 19 1.4 41
#13 25 1.3 41
Here's a data.table solution (relies crucially on Pressure not repeating itself later on):
library(data.table)
setDT(df)[,if(.N>=5) .SD,by=Pressure]
Addendum:
If you expect Pressure values to repeat later on, e.g.:
df<-data.frame(File=c(1:19,25:28),
Pressure=rep(c(46:41,46),c(6,3,2,1,1,7,3)))
Then you'll need to use rleid in order to keep only groups of at least 5 in a row (no gaps):
setDT(df)[,ct:=rleid(Pressure)][,if (.N>=5) .SD,by=ct]
Here is a solution using base R:
df <- data.frame(id=1:10, Pressure=c(rep(1,5),6:10))
p.counts <- table(df[,"Pressure"])
good.pressures <- as.numeric(names(p.counts))[p.counts>=5]
df.sub <- df[df[,"Pressure"]%in%good.pressures,]
Note that I'm using df as an example data set, so you can delete that first line of code and replace all instances of df with the name of your data.frame.

Why length function does not work correct in R?

Following R code gives the cars which are in Type Small. But length function returns 6 instead of 13. Why is that?
> fuel.frame[fuel.frame$Type=="Small",]
row.names Weight Disp. Mileage Fuel Type
1 Eagle.Summit.4 30 0.97 33 3.030303 Small
2 Ford.Escort.4 28 114.00 33 3.030303 Small
3 Ford.Festiva.4 23 0.81 37 2.702703 Small
4 Honda.Civic.4 27 0.91 32 3.125000 Small
5 Mazda.Protege.4 29 113.00 32 3.125000 Small
6 Mercury.Tracer.4 27 0.97 26 3.846154 Small
7 Nissan.Sentra.4 27 0.97 33 3.030303 Small
8 Pontiac.LeMans.4 28 0.98 28 3.571429 Small
9 Subaru.Loyale.4 27 109.00 25 4.000000 Small
10 Subaru.Justy.3 24 0.73 34 2.941176 Small
11 Toyota.Corolla.4 28 0.97 29 3.448276 Small
12 Toyota.Tercel.4 25 0.89 35 2.857143 Small
13 Volkswagen.Jetta.4 28 109.00 26 3.846154 Small
> length(fuel.frame[fuel.frame$Type=="Small",])
[1] 6
length gives in this case the number of columns in the data frame. You can instead use nrow or ncol to get the number of rows or number of columns respectively:
nrow(fuel.frame[fuel.frame$Type=="Small",])
Another example using iris dataset:
> d = head(iris)
> d
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(d)
[1] 6
> ncol(d)
[1] 5
> dim(d)
[1] 6 5
I thought it might help to give a bit of an explanation as to thy your getting your result. Your asking the length of the data.frame not the vector. Since the data.frame has 6 columns that explains your result.
this asks for the vector specifically:
length(fuel.frame$Type[fuel.frame$Type=="Small"])
and so does this:
length(fuel.frame[fuel.frame$Type=="Small",][,1])
or use nrow instead of length as already suggested.

Resources