Scraping web tables in R with interactive elements on page

Scraping web tables in R with interactive elements on page - r

I'm using rvest and tidyverse to scrape and process some data off the web.
There was recently a change to the website where some of the data is now in 2 tables and you can change between them using a button.
I'm trying to figure out how to scrape the data from both. They seem to have the same css class now so I can't figure out how to access each individually.
The code below seems to grab the "extended snowfall history", but I can't seem to figure out how to get the "2022-2023 winter season" data. Obviously I'll need to do a little processing and math to put the "2022-2023 winter season" into a new row in "extended snowfall history", but I can't even figure out how to grab it.
Currently I have :
library(rvest)
library(tidyverse)
mammoth <- read_html('https://www.mammothmountain.com/on-the-mountain/historical-snowfall')
snow <- mammoth %>%
html_element('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE) %>%
mutate_if(is.character,as.factor) %>%
mutate_if(is.integer,as.double) %>%
select(-Total)

A simple approach would be to use rvest::html_elements('table.css-86hwhl') (plural rather than singular) which will extract all html elements with the css class 'table.css-86hwhl'. Then you can manually choose the tables you want.
For example:
mammoth %>%
html_elements('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE)
gives a list of datasets
[[1]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[3]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[4]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[5]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
You can then just extract [[1]] and [[2]] and go from there, the tables that you are looking for. I'm sure there's a more principled approach out there, but this should do the job.

Related

Linear regression after select() method in R

I am trying to create a linear regression model from openintro::babies that predicts a baby's birthweight from all other variables in the data except case.
I have to following code:
library(tidyverse)
library(tidymodels)
babies <- openintro::babies %>%
drop_na() %>%
mutate(bwt = 28.3495 * bwt) %>%
mutate(weight = 0.453592 * weight)
linear_reg() %>%
set_engine("lm") %>%
fit(formula = bwt ~ ., data = babies %>% select(-case)) %>%
pluck("fit") %>%
augment(babies)
but in my output, I obtain the case variable as well
# A tibble: 1,174 x 14
case bwt gestation parity age height weight smoke .fitted .resid .hat .sigma .cooksd .std.resid
<int> <dbl> <int> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3402. 284 0 27 62 45.4 0 3459. -56.8 0.00374 449. 0.00000863 -0.127
2 2 3203. 282 0 33 64 61.2 0 3547. -344. 0.00227 449. 0.000191 -0.767
3 3 3629. 279 0 28 64 52.2 1 3244. 385. 0.00291 449. 0.000307 0.858
4 5 3062. 282 0 23 67 56.7 1 3396. -334. 0.00475 449. 0.000379 -0.746
5 6 3856. 286 0 25 62 42.2 0 3474. 381. 0.00495 449. 0.000515 0.851
6 7 3912. 244 0 33 62 80.7 0 3065. 848. 0.0137 448. 0.00715 1.90
7 8 3742. 245 0 23 65 63.5 0 3124. 618. 0.00716 449. 0.00197 1.38
8 9 3402. 289 0 25 62 56.7 0 3558. -156. 0.00301 449. 0.0000521 -0.348
9 10 4054. 299 0 30 66 61.7 1 3591. 463. 0.00462 449. 0.000710 1.03
10 11 3969. 351 0 27 68 54.4 0 4527. -558. 0.0221 449. 0.00510 -1.26
# ... with 1,164 more rows
I'm not sure is it the correct way or it is inherent with the output.

Your code is correct. You're getting the case column because of the augment(babies) call, but if you replace it with augment(babies %>% select(-case)) you wont get that column. In other words, the regression model you're fitting does not take into acount the case column].

Use cases with higher value on one variable for each case of another variable in R

I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!

If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))

You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0

Setting column names to be value of nth row for list of data frames

I'm trying to bind the rows of the following list (I cut it short because of character limits, but the list has ~15 elements).
library(tidyverse)
library(rvest)
library(magrittr)
url <- "http://stats.swehockey.se/Teams/Info/PlayersByTeam/9301"
url %>%
read_html() %>%
html_nodes('[class = "tblContent"]') %>%
html_table() %>%
magrittr::extract(c(TRUE, FALSE))
#> [[1]]
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 Rk No Name
#> 3 1 10 Gynge, Richard
#> 4 2 21 Sandberg, Christian
#> 5 3 6 Tärnström, Dick
#> 6 4 18 Bergström, Patrik
#> 7 5 20 Bång, Daniel
#> 8 6 92 Westerling, Jonas
#> 9 7 68 Ahlström, Victor
#> 10 8 22 Beck, Mattias
#> 11 9 79 Eriksson, Henrik
#> 12 10 86 Ahlström, Oscar
#> 13 11 16 Ryno, Johan
#> 14 12 23 Liwing, Jonas
#> 15 13 19 Ericsson, Tobias
#> 16 14 37 Johansson, Stefan
#> 17 15 45 Savilahti-Nagander, Per
#> 18 16 33 Engblom, David
#> 19 17 4 Österberg, Mikael
#> 20 18 3 Jungbeck, Andreas
#> 21 19 7 Olsson, Filip
#> 22 20 17 Carlsson, Fredrik
#> 23 21 89 Lawson, Lucas
#> 24 22 15 Korduner, Fredric
#> 25 23 72 Dahlström, Andreas
#> 26 24 12 Nemeth, Patrik
#> 27 25 89 Franzén, Mathias
#> 28 26 40 Heino-Lindberg, Christopher
#> 29 27 12 Bergman, Alexander
#> 30 89 Gozzi, Patric
#> 31 72 Lundberg, Martin
#> 32 30 15 Lehmann, Niclas
#> 33 31 72 Nordström, Joakim
#> 34 89 Ramsten, Joakim
#> 35 33 35 Lundström, Niklas
#> 36 34 12 Nilsson, Henrik
#> 37 35 35 Strandberg, Joakim
#> 38 36 35 Bjurö, Jonatan
#> 39 37 31 Holmgren, Fredrik
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 Pos GP G
#> 3 LW 51 28
#> 4 CE 51 20
#> 5 LD 42 9
#> 6 RW 51 15
#> 7 RW 48 14
#> 8 LW 46 7
#> 9 LW 51 14
#> 10 LW 51 8
#> 11 CE 47 7
#> 12 RW 52 6
#> 13 RW 47 9
#> 14 RD 52 8
#> 15 LW 50 7
#> 16 LD 43 4
#> 17 RD 49 3
#> 18 CE 39 5
#> 19 RD 50 3
#> 20 LD 45 1
#> 21 LD 33 0
#> 22 LD 45 0
#> 23 CE 11 3
#> 24 RW 13 2
#> 25 CE 6 1
#> 26 LD 16 0
#> 27 LD 5 0
#> 28 GK 41 0
#> 29 LD 1 0
#> 30 LD 1 0
#> 31 LD 1 0
#> 32 RW 1 0
#> 33 CE 2 0
#> 34 LD 2 0
#> 35 GK 4 0
#> 36 RD 4 0
#> 37 GK 5 0
#> 38 GK 19 0
#> 39 GK 35 0
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 A TP PIM
#> 3 24 52 16
#> 4 27 47 67
#> 5 23 32 74
#> 6 16 31 57
#> 7 17 31 86
#> 8 23 30 18
#> 9 14 28 16
#> 10 18 26 22
#> 11 15 22 52
#> 12 15 21 8
#> 13 11 20 34
#> 14 12 20 36
#> 15 9 16 12
#> 16 11 15 26
#> 17 9 12 90
#> 18 5 10 22
#> 19 7 10 69
#> 20 7 8 52
#> 21 5 5 2
#> 22 5 5 49
#> 23 0 3 2
#> 24 1 3 2
#> 25 2 3 4
#> 26 3 3 8
#> 27 1 1 12
#> 28 1 1 4
#> 29 0 0 0
#> 30 0 0 0
#> 31 0 0 0
#> 32 0 0 2
#> 33 0 0 0
#> 34 0 0 0
#> 35 0 0 0
#> 36 0 0 0
#> 37 0 0 0
#> 38 0 0 0
#> 39 0 0 0
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 + - +/-
#> 3 42 22 20
#> 4 34 24 10
#> 5 33 24 9
#> 6 35 24 11
#> 7 31 27 4
#> 8 31 16 15
#> 9 29 22 7
#> 10 22 23 -1
#> 11 27 25 2
#> 12 27 22 5
#> 13 28 21 7
#> 14 43 28 15
#> 15 21 15 6
#> 16 30 19 11
#> 17 36 26 10
#> 18 19 11 8
#> 19 23 25 -2
#> 20 28 20 8
#> 21 14 8 6
#> 22 29 16 13
#> 23 6 1 5
#> 24 4 3 1
#> 25 4 2 2
#> 26 11 4 7
#> 27 1 2 -1
#> 28
#> 29 0 0 0
#> 30 0 0 0
#> 31 0 0 0
#> 32 0 0 0
#> 33 0 0 0
#> 34 0 0 0
#> 35
#> 36 0 0 0
#> 37
#> 38
#> 39
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 GWG PPG SHG
#> 3 5 5 0
#> 4 4 8 1
#> 5 2 7 0
#> 6 2 3 0
#> 7 2 8 1
#> 8 2 1 1
#> 9 4 3 0
#> 10 1 4 1
#> 11 0 0 0
#> 12 0 3 0
#> 13 0 0 0
#> 14 2 3 0
#> 15 1 0 1
#> 16 1 1 0
#> 17 1 0 0
#> 18 4 0 0
#> 19 1 0 0
#> 20 1 1 0
#> 21 0 0 0
#> 22 0 0 0
#> 23 0 0 0
#> 24 0 0 0
#> 25 1 0 0
#> 26 0 0 0
#> 27 0 0 0
#> 28 0 0 0
#> 29 0 0 0
#> 30 0 0 0
#> 31 0 0 0
#> 32 0 0 0
#> 33 0 0 0
#> 34 0 0 0
#> 35 0 0 0
#> 36 0 0 0
#> 37 0 0 0
#> 38 0 0 0
#> 39 0 0 0
#> A I K  IF A I K  IF A I K  IF
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 SOG SG% FO+
#> 3 131 21.37 140
#> 4 157 12.74 360
#> 5 102 8.82 0
#> 6 88 17.05 6
#> 7 91 15.38 2
#> 8 72 9.72 171
#> 9 90 15.56 18
#> 10 102 7.84 3
#> 11 91 7.69 435
#> 12 92 6.52 1
#> 13 127 7.09 234
#> 14 91 8.79 1
#> 15 79 8.86 18
#> 16 95 4.21 0
#> 17 86 3.49 0
#> 18 35 14.29 245
#> 19 33 9.09 0
#> 20 66 1.52 0
#> 21 17 0.00 0
#> 22 17 0.00 0
#> 23 19 15.79 46
#> 24 13 15.38 0
#> 25 4 25.00 9
#> 26 8 0.00 0
#> 27 3 0.00 0
#> 28 0 N/A
#> 29 0 N/A 0
#> 30 0 N/A 0
#> 31 0 N/A 0
#> 32 1 0.00 0
#> 33 1 0.00 5
#> 34 0 N/A 0
#> 35 0 N/A
#> 36 0 N/A 0
#> 37 0 N/A
#> 38 0 N/A
#> 39 1 0.00
#> A I K  IF A I K  IF [Top]
#> 1 Playing Statistics Playing Statistics Playing Statistics
#> 2 FO- FO FO%
#> 3 127 267 52.43
#> 4 270 630 57.14
#> 5 0 0 N/A
#> 6 10 16 37.50
#> 7 5 7 28.57
#> 8 192 363 47.11
#> 9 29 47 38.30
#> 10 5 8 37.50
#> 11 335 770 56.49
#> 12 3 4 25.00
#> 13 250 484 48.35
#> 14 9 10 10.00
#> 15 20 38 47.37
#> 16 0 0 N/A
#> 17 0 0 N/A
#> 18 253 498 49.20
#> 19 1 1 0.00
#> 20 0 0 N/A
#> 21 3 3 0.00
#> 22 0 0 N/A
#> 23 40 86 53.49
#> 24 0 0 N/A
#> 25 19 28 32.14
#> 26 0 0 N/A
#> 27 1 1 0.00
#> 28
#> 29 0 0 N/A
#> 30 0 0 N/A
#> 31 0 0 N/A
#> 32 0 0 N/A
#> 33 8 13 38.46
#> 34 0 0 N/A
#> 35
#> 36 0 0 N/A
#> 37
#> 38
However, dplyr::bind_rows() doesn't work. I assume it's because the column names are not the same. The variable names of interest are in row 2.
How could I set the column names for each variable to be the value of the 2nd row for each element of the list?
Something i've tried is
url %>%
read_html() %>%
html_nodes('[class = "tblContent"]') %>%
html_table() %>%
magrittr::extract(c(TRUE, FALSE)) %>%
map(set_names, nm = slice(., 2))
But this doesn't work. Any ideas?

A solution using the tidyverse package. Assuming the list of data frames you downloaded is called dat_list. dat is the final output of data frame. The key is to use the information in row 2 to rename each data frame and use map_dfr to combine each individual data frame.
library(tidyverse)
library(rvest)
library(magrittr)
url <- "http://stats.swehockey.se/Teams/Info/PlayersByTeam/9301"
dat_list <- url %>%
read_html() %>%
html_nodes('[class = "tblContent"]') %>%
html_table() %>%
magrittr::extract(c(TRUE, FALSE))
dat <- dat_list %>%
map_dfr(function(x){
name_vec <- as.vector(x[2, ])
temp <- x %>%
setNames(name_vec) %>%
slice(3:n())
return(temp)
})

Here's a short function that grabs the second row for the names and removes the first 2 rows.
convertData = function(df) {
newnames = unlist(df[2, ], use.names = F)
df = df[3:nrow(df), ]
names(df) = newnames
return(df)
}
So you can lapply() it to all the data.frames then combine them into 1 data.frame.
list_ = url %>%
read_html() %>%
html_nodes('[class = "tblContent"]') %>%
html_table() %>%
magrittr::extract(c(TRUE, FALSE)) %>%
lapply(convertData) %>%
bind_rows()

Removing data with a non-numeric column value in R

So I have a dataset that includes the lung capacity of certain individuals. I am trying to analyze the data distributions and relations. The only problem is that the data is somewhat incomplete. Some of the rows include "N/A" as the lung capacity. This is causing an issue because it is resulting in a mean and sd of always "N/A" for the different subsets. How would I form this into a subset so that it only includes the data that isn't N/A?
I've tried this:
fData1 = read.table("lung.txt",header=TRUE)
fData2= fData1[fData1$fev!="N/A"]
but this gives me an "undefinied columns selected error".
How can I make it so that I have a data set that excludes the rows with "N/A"?
Here is the begining of my data set:
id age fev height male smoke
1 72 1.2840 66.5 1 1
2 81 2.5530 67.0 0 0
3 90 2.3830 67.0 1 0
4 72 2.6990 71.5 1 0
5 70 2.0310 62.5 0 0
6 72 2.4100 67.5 1 0
7 75 3.5860 69.0 1 0
8 75 2.9580 67.0 1 0
9 67 1.9160 62.5 0 0
10 70 NA 66.0 0 1

One option is to apply the operations excluding the NA values:
dat <- read.table("lung.txt", header = T, na.strings = "NA")
mean(dat$fev, na.rm=T) # mean of fev col
sd(dat$fev, na.rm=T)
If you simply want to get rid of the NAs:
fData1 <- na.omit(fData1)
fData1 <- na.exclude(fData1) # same result
If you'd like to save the rows with NA's here are 2 options:
fData2 <- fData1[is.na(fData1$fev), ]
fData2 <- subset(fData1, is.na(fData1$fev))

If you just want to filter out rows with NA values, you can use complete.cases():
> df
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
10 10 70 NA 66.0 0 1
> df[complete.cases(df), ]
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0

How to retain row and column formate of a text file(.cel) in R

I read a text file in R, looks like below, with 1354896 rows and 5 colums.
I try read.table(), and read.delim() to upload the file, however the format of file after upload changes. It transforms everything into a single column.
OffsetY=0
GridCornerUL=258 182
GridCornerUR=8450 210
GridCornerLR=8419 8443
GridCornerLL=228 8414
Axis-invertX=0
AxisInvertY=0
swapXY=0
DatHeader=[19..65528] PA-D 102 Full:CLS=8652 RWS=8652 XIN=1 YIN=1 VE=30 2.0 11/04/03 12:49:30 50205710 M10 HG-U133_Plus_2.1sq 6
Algorithm=Percentile
AlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004;AlgVersion:6.0;FixedCellSize:TRUE;FullFeatureWidth:7;FullFeatureHeight:7;IgnoreOutliersInShiftRows:FALSE;FeatureExtraction:TRUE;PoolWidthExtenstion:2;PoolHeightExtension:2;UseSubgrids:FALSE;RandomizePixels:FALSE;ErrorBasis:StdvMean;StdMult:1.000000
[INTENSITY]
NumberCells=1354896
CellHeader=X Y MEAN STDV NPIXELS
0 0 147.0 23.5 25
1 0 10015.0 1276.7 25
2 0 160.0 24.7 25
3 0 9710.0 1159.8 25
4 0 85.0 14.0 25
5 0 171.0 21.0 25
6 0 11648.0 1678.4 25
7 0 163.0 30.7 25
8 0 12044.0 1430.1 25
9 0 169.0 25.7 25
10 0 11646.0 1925.6 25
11 0 176.0 30.7 25
After reading the format is changed as shown below.:
I want to retain the format of rows and colums
I want to remove all the content before [intensity] like (offset, GridCornerUL, so on) shown in the first file.

You could trys:
txt <- readLines("file.txt")
df <- read.csv(text = txt[-(1:grep("NumberCells=\\d+", txt))], check.names = FALSE)
write.csv(df, tf <- tempfile(fileext = ".csv"), row.names = FALSE)
read.csv(tf, check.names = FALSE) # just to verify...
# CellHeader=X Y MEAN STDV NPIXELS
# 1 0 0 147.0 23.5 25
# 2 1 0 10015.0 1276.7 25
# 3 2 0 160.0 24.7 25
# 4 3 0 9710.0 1159.8 25
# 5 4 0 85.0 14.0 25
# 6 5 0 171.0 21.0 25
# 7 6 0 11648.0 1678.4 25
# 8 7 0 163.0 30.7 25
# 9 8 0 12044.0 1430.1 25
# 10 9 0 169.0 25.7 25
# 11 10 0 11646.0 1925.6 25
# 12 11 0 176.0 30.7 25
This omits everything before and including NumberCells=1354896.

As you are using linux, another option would be to pipe the awk with read.table or fread
read.table(pipe("awk 'NR==1, /NumberCells/ {next}{print}' Hashim.txt"),
header=TRUE, check.names=FALSE)
# CellHeader=X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25

If NumberCells= always appears immediately before the header row, then you can exploit this to tell you the number of lines to skip:
dat<-readLines("file.txt")
read.table(textConnection(dat), header=TRUE, skip=grep("NumberCells", dat))
# CellHeader.X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25
Edit
Because your files have a lot of rows, you may want to limit the number of lines that readLines reads in. To do this, you need to know the maximum number of lines before your header row. For instance, if you know your header row will always come within the first 200 lines of the file, you can do:
dat<-readLines("file.txt", n=200)
read.table("file.txt", header=TRUE, skip=grep("NumberCells", dat))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping web tables in R with interactive elements on page - r

Related

Linear regression after select() method in R

Use cases with higher value on one variable for each case of another variable in R

Setting column names to be value of nth row for list of data frames

Removing data with a non-numeric column value in R

How to retain row and column formate of a text file(.cel) in R

Categories

Resources