R: Subsetting returns "0 obs." - r

I'm trying to subset my dataset 'eggdat' for daytime and nighttime hours. This:
'data.frame': 54847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Subsetting for daytime works fine:
daytime <- eggdat[eggdat$hour >= 7 & eggdat$hour <= 20, ]
'data.frame': 18847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Doing exactly the same thing for nighttime, however, returns a subset with 0 observations:
nighttime <- eggdat[eggdat$hour <= 7 & eggdat$hour >= 21, ]
'data.frame': 0 obs. of 10 variables:
$ year : int
$ month : int
$ day : int
$ hour : int
$ minute: int
$ second: int
$ Roll : num
$ Pitch : num
$ Yaw : num
$ temp1 : num
I really don't know what to do.. I tried using subset , but without success.. I also tried eggdat$hour <- as.factor(eggdat$hour), but couldn't get it to work either.
Even more confusingly, adding the quotation marks in the subset function (daytime <- eggdat[eggdat$hour >= '7' & eggdat$hour <= '20', ] and nighttime <- eggdat[eggdat$hour <= '7' & eggdat$hour >= '21', ]) resulted in the daytime subset containing '0 obs.', but the nighttime subset working fine, so it's just the other way around!
Daytime: 'data.frame': 0 obs. of 10 variables:
Nighttime:
'data.frame': 28800 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 21 21 21 21 21 21 21 21 21 21 ...
$ minute: int 0 0 0 0 0 0 0 0 0 0 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num 65.8 65.8 66.1 65.6 65.6 ...
$ Pitch : num 6.35 6.34 6.24 6.4 6.27 ...
$ Yaw : num 171 172 174 176 176 ...
$ temp1 : num 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 ...
I really don't know what to do, I'm very confused by all of this..

You want eggdat[eggdat$hour <= 7 | eggdat$hour >= 21, ]
x < 7 & x > 21 translates to x smaller than 7 AND larger than 21
x < 7 | x > 21 translates to x smaller than 7 OR larger than 21

Related

How to split a CHR column, pivot, then combine tables?

So I have two tables:
LST data (24 months in total) (already pivoted_longer)
Buffer Date LST
<chr> <chr> <chr>
1 100 15/01/2010 6.091741043
2 100 16/02/2010 6.405879111
3 100 20/03/2010 8.925945159
4 100 24/04/2011 6.278147269
5 100 07/05/2010 6.133940129
6 100 08/06/2010 7.705591939
7 100 13/07/2011 4.066052173
8 100 11/08/2010 5.962087092
9 100 12/09/2010 5.761892842
10 100 17/10/2011 3.155769317
# ... with 1,550 more rows
Weather data (24 months in total)
Weather variable 15/01/2010 16/02/2010 20/03/2010 24/04/2011 07/05/2010
1 Temperature 12.0 15.0 16.0 23.00 21.50
2 Wind_speed 10.0 9.0 10.5 19.50 9.50
3 Wind_trend 1.0 1.0 1.0 0.00 1.00
4 Wind_direction 22.5 45.0 67.5 191.25 56.25
5 Humidity 40.0 44.5 22.0 24.50 7.00
6 Pressure 1024.0 1018.5 1025.0 1005.50 1015.50
7 Pressure_trend 1.0 1.0 1.0 1.00 1.00
If I pivot the weather data I get:
1 Temperature 15/01/2010 12
2 Temperature 16/02/2010 15
3 Temperature 20/03/2010 16
4 Temperature 24/04/2011 23
5 Temperature 07/05/2010 21.5
6 Temperature 08/06/2010 36.5
7 Temperature 13/07/2011 33
8 Temperature 11/08/2010 34.5
9 Temperature 12/09/2010 33
10 Temperature 17/10/2011 27
# ... with 158 more rows
(each weather variable listed in turn).
I need to combine 1) and 3) - using the date and something like data_long <- merge(LST_data,weather_data,by="Date") I think - appending weather data columns to each row in 1).
But I'm stuck.
The solution I found to this was to pivot the weather data (longer):
weather_long <- weather %>% pivot_longer(cols = 2:21, names_to = "Date", values_to = "Value")
which gives a tibble in the format:
# A tibble: 180 x 3
`Weather variable` Date Value
<chr> <chr> <dbl>
1 Temperature 28/10/2016 17
2 Temperature 31/12/2016 22
3 Temperature 16/01/2017 25
4 Temperature 05/03/2017 19
(as described above in the question).
Because this process changes the 'Date' variable type:
tibble [180 x 3] (S3: tbl_df/tbl/data.frame)
$ Weather variable: chr [1:180] "Temperature" "Temperature" "Temperature" "Temperature" ...
$ Date : chr [1:180] "28/10/2016" "31/12/2016" "16/01/2017" "05/03/2017" ...
$ Value : num [1:180] 17 22 25 19 20 22 11 10 3 9 ...
I then corrected this:
weather_long$Date <- as.Date(weather_long$Date, format = "%d/%m/%Y")
Next was to convert the weather data to the 'wide' format (in preparation for the next step):
weather_wide <- weather_long %>%
pivot_wider(names_from = "Weather variable", values_from = "Value")
Then join it to the LST data using the Date column as the key:
LST_Weather_dataset <- full_join(data_long, weather_wide, by = "Date")
This produced the desired result:
str(LST_Weather_dataset)
'data.frame': 380 obs. of 16 variables:
$ Buffer : int 100 200 300 400 500 600 700 800 900 1000 ...
$ Date : Date, format: "2016-10-28" "2016-10-28" "2016-10-28" "2016-10-28" ...
$ LST : num 0.918 0.951 0.791 0.748 0.687 ...
$ Month : num 10 10 10 10 10 10 10 10 10 10 ...
$ Year : num 2016 2016 2016 2016 2016 ...
$ JulianDay : num 302 302 302 302 302 302 302 302 302 302 ...
$ TimePeriod : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Temperature : num 17 17 17 17 17 17 17 17 17 17 ...
$ Humidity : num 59 59 59 59 59 59 59 59 59 59 ...
$ Humidity_trend: num 1 1 1 1 1 1 1 1 1 1 ...
$ Wind_speed : num 19 19 19 19 19 19 19 19 19 19 ...
$ Wind_gust : num 0 0 0 0 0 0 0 0 0 0 ...
$ Wind_trend : num 2 2 2 2 2 2 2 2 2 2 ...
$ Wind_direction: num 338 338 338 338 338 ...
$ Pressure : num 1017 1017 1017 1017 1017 ...
$ Pressure_trend: num 2 2 2 2 2 2 2 2 2 2 ...

List being added to a dataframe

Why is a list being added to my dataframe here?
Here's my dataframe
df <- data.frame(ch = rep(1:10, each = 12), # care home id
year_id = rep(2018),
month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)
df <- arrange(df, df$ch, df$year_id, df$month_id)
str(df)
> str(df)
'data.frame': 120 obs. of 7 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations: int 576 598 608 576 608 637 611 613 593 626 ...
$ missed : int 18 18 19 16 16 13 17 16 15 17 ...
$ beds : int 38 38 38 38 38 38 38 38 38 38 ...
$ rating : int 2 2 2 2 2 2 2 2 2 2 ...
All good so far.
I just want to add another column that sequences the month number within the ch group (this equates to the actual month_id in this example but ignore that, my real life data is different), so I'm using:
df <- df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n())
This appears to add a bunch stuff I don't really understand or want or need, such as a list ...
str(df)
> str(df)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations : int 601 590 593 599 615 611 628 587 604 600 ...
$ missed : int 16 14 17 16 18 16 15 18 15 20 ...
$ beds : int 35 35 35 35 35 35 35 35 35 35 ...
$ rating : int 3 3 3 3 3 3 3 3 3 3 ...
$ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "groups")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
..$ ch : int 1 2 3 4 5 6 7 8 9 10
..$ .rows:List of 10
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 37 38 39 40 41 42 43 44 45 46 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
.. ..$ : int 73 74 75 76 77 78 79 80 81 82 ...
.. ..$ : int 85 86 87 88 89 90 91 92 93 94 ...
.. ..$ : int 97 98 99 100 101 102 103 104 105 106 ...
.. ..$ : int 109 110 111 112 113 114 115 116 117 118 ...
..- attr(*, ".drop")= logi TRUE
What's going on here? I just want a dataframe. Why is there all that additional output after $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ... and more importantly can I ignore it and just keep treating it as a normal dataframe (i'll be running some generalised linear mixed models on the df)?
The attribute "groups" is where dplyr stores the grouping information added when you did group_by(ch). It doesn't hurt anything, and it will disappear if you ungroup():
df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n()) %>%
ungroup %>%
str
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
# $ ch : int 1 1 1 1 1 1 1 1 1 1 ...
# $ year_id : num 2018 2018 2018 2018 2018 ...
# $ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
# $ totaladministrations : int 575 597 579 605 582 599 577 604 630 632 ...
# $ missed : int 18 16 16 18 18 11 10 13 17 16 ...
# $ beds : int 33 33 33 33 33 33 33 33 33 33 ...
# $ rating : int 3 3 3 3 3 3 3 3 3 3 ...
# $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
As a side-note, you should use bare column names inside dplyr verbs, not data$column. With arrange, it doesn't much matter, but in grouped operations it will cause bugs. You should get in the habit of using arrange(df, ch, year_id, month_id) instead of arrange(df, df$ch, df$year_id, df$month_id).

Retrieving corresponding column values based on row label [duplicate]

I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected
You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]

Not able to scrape a second table within a page using rvest

I'm able to scrape the first table of this page using the rvest package and using the following code:
library(rvest)
library(magrittr)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
Bat <- urlbbref %>%
html_node(xpath = '//*[(#id = "bio_batting")]') %>%
html_table()
But I'm not able to scrape the second table of this page. I use selectorgadget to find the xpath of both tables and I use that info in the code, but it doesn't seem to be working for the second one.
Pit <- urlbbref %>%
html_node(xpath = '//*[(#id = "div_bio_pitching")]') %>%
html_table()
I come up with 3 tables in total.
library(magrittr)
library(rvest)
library(xml2)
library(stringi)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table
# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
# Put all the data frames into a list.
all_tables <- c(
table_one, alt_tables
)
Results:
> Map(str, all_tables)
'data.frame': 361 obs. of 27 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Bobby Abreu" "Ehire Adrianza" "Jesus Aguilar" "Edgardo Alfonzo" ...
$ Yrs : int 18 4 4 12 6 7 1 5 5 2 ...
$ From : int 1996 2013 2014 1995 2006 2011 2000 2011 2013 2002 ...
$ To : int 2014 2016 2017 2006 2011 2017 2000 2015 2017 2004 ...
$ ASG : int 2 0 0 1 0 4 0 1 0 0 ...
$ G : int 2425 154 47 1506 193 842 2 92 150 38 ...
$ PA : int 10081 331 89 6108 624 3708 5 109 3 75 ...
$ AB : int 8480 291 81 5385 591 3411 5 94 2 64 ...
$ R : int 1453 27 4 777 44 456 1 5 0 11 ...
$ H : int 2470 64 18 1532 142 1062 1 22 0 16 ...
$ 2B : int 574 16 3 282 24 208 0 4 0 4 ...
$ 3B : int 59 1 0 18 3 19 0 0 0 0 ...
$ HR : int 288 3 0 146 17 60 0 1 0 2 ...
$ RBI : int 1363 26 8 744 67 326 0 9 0 10 ...
$ SB : int 400 4 0 53 1 204 0 0 0 1 ...
$ CS : int 128 4 0 17 2 59 0 0 0 0 ...
$ BB : int 1476 23 6 596 17 214 0 1 1 7 ...
$ SO : int 1840 60 28 617 158 389 1 34 0 12 ...
$ BA : num 0.291 0.22 0.222 0.284 0.24 0.311 0.2 0.234 0 0.25 ...
$ OBP : num 0.395 0.292 0.281 0.357 0.271 0.354 0.2 0.237 0.333 0.324 ...
$ SLG : num 0.475 0.313 0.259 0.425 0.377 0.436 0.2 0.309 0 0.406 ...
$ OPS : num 0.87 0.605 0.54 0.782 0.648 0.791 0.4 0.546 0.333 0.731 ...
$ Birthdate : chr "Mar 11, 1974" "Aug 21, 1989" "Jun 30, 1990" "Nov 8, 1973" ...
$ Debut : chr "Sep 1, 1996" "Sep 8, 2013" "May 15, 2014" "Apr 26, 1995" ...
$ Birthplace: chr "Maracay, Aragua" "Guarenas, Miranda" "Maracay, Aragua" "Santa Teresa del Tuy, Miranda" ...
$ Pos : chr "POS" "POS" "POS" "POS" ...
'data.frame': 157 obs. of 31 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Henderson Alvarez" "Jose Alvarez" "Wilson Alvarez" "Alexi Amarista" ...
$ Yrs : int 5 5 14 7 5 2 10 4 6 4 ...
$ From : int 2011 2013 1989 2011 1980 2015 1999 2007 2012 2005 ...
$ To : int 2015 2017 2005 2017 1984 2016 2008 2011 2017 2009 ...
$ ASG : int 1 0 1 0 0 0 0 0 0 0 ...
$ W : int 27 6 102 0 9 4 53 1 15 3 ...
$ L : int 34 12 92 0 6 2 65 3 6 4 ...
$ W-L% : num 0.443 0.333 0.526 NA 0.6 0.667 0.449 0.25 0.714 0.429 ...
$ ERA : num 3.8 3.97 3.96 0 3.27 4.35 4.65 5.28 2.91 6.86 ...
$ G : int 92 150 355 2 110 72 185 43 275 25 ...
$ GS : int 92 6 263 0 0 0 167 0 0 8 ...
$ GF : int 0 32 18 2 66 14 7 16 36 12 ...
$ CG : int 5 0 12 0 0 0 0 0 0 0 ...
$ SHO : int 5 0 5 0 0 0 0 0 0 0 ...
$ SV : int 0 0 4 0 7 0 0 0 0 0 ...
$ IP : num 563 167.2 1747.2 0.2 220 ...
$ H : int 596 174 1624 0 222 64 891 57 177 68 ...
$ R : int 261 85 857 0 86 39 519 29 75 51 ...
$ ER : int 238 74 769 0 80 30 478 27 72 46 ...
$ HR : int 54 17 190 0 17 5 122 7 10 4 ...
$ BB : int 129 55 805 0 68 36 431 21 80 34 ...
$ IBB : int 7 10 29 0 7 3 41 5 17 1 ...
$ SO : int 296 148 1330 0 113 63 680 41 180 37 ...
$ HBP : int 22 8 50 0 3 2 51 4 11 4 ...
$ BK : int 3 1 4 0 3 1 6 0 3 1 ...
$ WP : int 16 3 28 0 5 2 43 1 14 2 ...
$ BF : int 2358 729 7518 2 928 285 4055 221 913 282 ...
$ Birthdate : chr "Apr 18, 1990" "May 6, 1989" "Mar 24, 1970" "Apr 6, 1989" ...
$ Debut : chr "Aug 10, 2011" "Jun 9, 2013" "Jul 24, 1989" "Apr 26, 2011" ...
$ Birthplace: chr "Valencia, Carabobo" "Barcelona, Anzoategui" "Maracaibo, Zulia" "Barcelona, Anzoategui" ...
'data.frame': 3 obs. of 17 variables:
$ Rk : int 1 2 NA
$ Mgr : chr "Ozzie Guillen" "Al Pedrique" "Totals"
$ Yrs : int 9 1 10
$ From : int 2004 2004 2004
$ To : int 2012 2004 2012
$ W : int 747 22 769
$ L : int 710 61 771
$ W-L% : num 0.513 0.265 0.499
$ Ties : int 0 0 0
$ G>.500 : int 37 -39 -2
$ G : int 1457 83 1540
$ BestFin : int 1 5 1
$ WrstFin : int 5 5 5
$ AvRk : num 2.7 5 2.8
$ Birthdate : chr "Jan 20, 1964" "Aug 11, 1960" ""
$ Debut : chr "Apr 9, 1985" "Apr 14, 1987" ""
$ Birthplace: chr "Ocumare del Tuy, Miranda" "Valencia, Carabobo" ""

Undefined columns selected when subsetting data frame

I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected
You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]

Resources